OrthoDB

user guide

Go to OrthoDB >>

Terminology

Orthologs

Orthologs are genes in different species that evolved from a common ancestral gene by speciation. If one or both of these genes were duplicated after the speciation they are all termed co-orthologs, or just orthologs.

Orthologous group / level-of-orthology

If there are more than two species considered, there are more than one speciation event, and we refer as orthologs, or orthologous group, to all descendants of a particular single gene of the last common ancestor of these species. Thus our operational definition refers to a specific phylogeny radiation for a set of species, termed the level-of-orthology.

Ortholog functions

It is a reasonable hypothesis that orthologs keep functions of their ancestor gene ("by tradition"), though there are examples of gene function gains and losses. The statement of gene orthology though refers to their evolutionary relation, not to the kept or altered functions.

Paralogs

Paralogs are genes that evolved by duplication inside a genome. Notions of orthologs and paralogs are disjoint, e.g. paralogs can be co-orthologs if duplicated after the speciation or can be not if duplicated earlier.

User interface

OrthoDB can be queried using a gene name, identifier, annotation keywords, etc. We indexed many relevant identifiers of proteins and genes, including UniProtKB, Ensembl, InterPro, KEGG, GenBank, RefSeq, etc.

To query specifically for a numeric NCBI gene - switch from Text to NCBI ID on the left of the search input.

To query for EC numbers - use double quotes, e.g. "3.1.1.-".

Text query format

  • Use double quotation marks to match a phrase, e.g. "Cytochrome P450"
  • Take advantage of the autocomplete lookup feature
  • Logical operator NOT use '-' or '!', e.g. kinase !tyrosine
  • Logical operator OR use '|', e.g. protease | peptidase
  • Logical operator AND is implicit, e.g. sodium transporter actually means sodium AND transporter (if not quoted)

OrthoDB can be queried by homology to a protein sequence: switch from Text to Sequence on the left of the search input and paste the query protein sequence without a header line.

Advanced options

Phyloprofile

The result list of Orthologous Groups can be filtered for

  • universality, i.e. having member genes in all species of the selected taxonomic node, or a fraction of them, e.g. present in all, in >90% or >80% of the species.
  • gene copy-number (duplicability), requiring them to have only single-copy orthologs in all species of the selected taxonomic node, or a fraction of them, e.g. single-copy in all, in >90% or >80% of the species.

You can combine any presence filter with any copy-number filter to refine your results, e.g. present in >90% AND single-copy in >80% of species.

Select species

You can tailor your search by using the expandable species tree to select a radiation point or particular sets of species.

  • Expand or collapse any node on the tree by clicking on the filled arrows or node names.
  • Select all species at a node by clicking on the unfilled box next to the node name, or
  • select specific species by clicking on the unfilled box next to the species name. You may also add species to the list of selected species to display by typing the species name in the search box and selecting from the autocompleted options. As you add or remove species from the expandable species tree, the Species to display box above it will automatically update to reflect your selections.

Search at (level-of-orthology)

OrthoDB Orthologous Groups are hierarchical, being delineated at the major radiations along the species phylogeny. This enables to precise orthologs to a particular level-of-orthology: considering many distantly-related species delineates fewer, more general (inclusive) orthologous groups containing all the descendants of the ancestral gene, while examining only sets of more closely-related species produces many fine-grained orthologous groups of mostly one-to-one relations.

The level-of-orthology can be adjusted after species or clades of interest were selected (see Select species)

Species to display

By default only genes from model species will be shown in details for returned Orthologous Groups in the Orthologs by organism section of results. This can be changed instead to a set of Species to display.

Results

Results of an OrthoDB query are shown as a list of relevant Orthologous Groups that are in a condensed view and require clicking on them to expand into a detailed view.

Each detailed record of an Orthologous Group has following sections:

Functional descriptions

OrthoDB provides tentative functional annotations of groups of orthologs and mapping to functional categories by summarizing functional gene annotations, extensively collected from other public resources. Annotation of genes is complicated and contains errors. Although in many cases OrthoDB makes such errors in the underlying data apparent, discordant annotations should be considered with caution.

Evolutionary descriptions

The evolutionary annotations of the orthologs remain a distinguishing feature of OrthoDB.

Phyletic Profile

is a summary of the ortholog presence (from universal to species-specific) and copy-numbers (single/multi-copy counts).

Evolutionary Rate

is a measure if this Orthologous Group exhibit appreciably higher or lower levels of sequence divergence, derived from quantification of the relative divergence among their member genes. These are computed for each orthologous group as the average of inter-species identities normalized to the average identity of all inter-species best reciprocal hits, computed from pairwise alignments of protein sequences. The relative rate is indicated by the position of the black star along the scale of slow-blue to fast-red rates.

Gene Architecture

shows median and standard deviation values of protein lengths and exon counts for each orthologous group, effectively describing a 'consensus' gene architecture (for those genes with available data).

Orthologs by organism

This section can be very long. Use navigation arrows on the left to go to the beginning or the end of the record, or the cross to collaps the detailed view to the condensed view.

Condensed view for each gene includs gene/protein ID, UniProt ID, short description, number of amino acids (AAs), number of exons, and associated InterPro domains.

For the length (AAs) and exon counts (Exons) listed for each gene, the exclamation mark (!) indicates differences from consensus (left: shorter, right: longer, !: 1 stdev, !!: 2 stdev).

Double-arrow icon

expands the view, if clicked, to the available for a given gene annotations with links to source databases.

Available annotation of InterPro domains are displayed for each protein member ordered from the N to C terminus. Click on the grey magnifying glass icon to query OrthoDB for groups containing proteins with the same domains. To search for specific domain architectures, enter an ordered list of InterPro identifiers separated with only commas into the 'Text Search' field.

Get All Fasta / View Fasta

retrieves the corresponding protein sequences in Fasta format. Group ID, gene, organism, and other useful details are contained in the header of each sequence. This information can be saved as a file by right-clicking on the link followed by "save link as...".

Get All as Tab Delimited / View Tab Delimited

retrieves the corresponding ortholog information as tab delimited text. This information can be saved as a file by right-clicking on the link followed by "save link as...".

Note that retrieving sequences in Fasta/Tab format is limited to a maximum of 5000 groups

Sibling Groups

Related orthologous groups at the same level-of-orthology are defined according to their common InterPro domain annotations. The top 5 groups are listed with their percentage overlap in terms of common InterPro domains, and the complete list of related groups may be retrieved by clicking the 'Show all siblings' link.

Uploading and analyzing your own sequences

Register

In order to be able to upload sequences for a custom analysis, you need to register:

  • Click on the "Register" link on the top right part of the OrthoDB webpage.
  • Enter your login detail in the form that will appear.

Data upload

Upload a fasta file with the sequences to be analyzed

After logging in, you can upload your sequences using the "Own data mapping" link (next to "Help").

After clicking on "Own data mapping", click on the "Upload" button and select your fasta-formatted file. Be aware that the file should contain amino acid sequences only.

After uploading is finished you will have to enter a species name in the corresponding field.

Select up to 5 species from the right panel; the mapping node will change accordingly, so that it represents the most recent common ancestor of the selected species. It is possible to select a different mapping node, as long as it is an ancestor of all the selected species.

Click on "Run analysis" to add your job to the mapping queue. When the job starts, the status should change from "CREATED" to "STARTED".

When mapping is done, the status will again change to "DONE".

Retrieve results

Download the results in a plain text file

Click on the "Download" button to get the mapping results. The name of this file contains all the mapping information:

  • node_XXX: where "XXX" is the NCBI taxon ID of the mapping node.
  • subnode_AAA_BBB_CCC_DDD_EEE: where AAA, BBB, CCC, DDD, and EEE are the NCBI taxon IDs for the selected species.
  • taxid_YYY: where YYY is a temporary taxon ID for your species.

The mapping file contains 9 fields:

  • Ortholog group name
  • Gene name
  • Ortholog type; for mapped sequences this field is a number >=10 and <20.
  • Length of the matching region (in amino acids).
  • Start coordinate of the match.
  • End coordinate of the match.
  • Score of the match.
  • Normalized score of the match.
  • E-value of the match.

Comparative Charts

This OrthoDB online tool allows generation of a comparative overview of the gene content across selected genomes. The total gene counts and the fractions of orthologs among these species shows the level of relatedness among the genomes, highlighting the "universal" core of genes and the ones evolving under single-copy constraint [PMID:21148284].

You can select up to 20 species on the right panel to be included into the comparative genomics chart. The colors, patterns, etc can be customised from the "Configure chart" tab on the right panel. The fractions shown are hyperlinked to their corresponding Ortholog Groups from which the gene counts were made. The tailored chart can then be exported as a publication quality vector graphics.

Explore an example

Bookmarking

Search results can be saved by simply bookmarking the result page or saving the URL text.

You can also drag & drop the bookmarklet link under Bookmark OrthoDB at the right side under the search field to the browser toolbar for easy OrthoDB search next time with the same settings. You can later just highlight a keyword somewhere on a web page and click on the saved bookmarklet to search OrthoDB for this keyword.

API

The OrthoDB data can be programatically accessed using a URL based interface. In our implementation this means that the data can be retrieved using the following:

URL

https://www.orthodb.org/CMD?ARG1="value"&ARG2="value&..."

where CMD is a command and all ARGx are arguments to that specific command. Below follows a description of the available commands with arguments.

NOTE the request rate is limited to 1 request/second for the following URL's:
/blast
/tab
/fasta
If the rate is too high, some of the requests will fail with a 503 error.

Data Formats

All data is returned in JSON format, except for /fasta and /tab. JSON data is widely supported by many languages. An overview with many examples can be found here.

The JSON returned is of the generic format:

          {
             "url"    : full url of request
             "message": message string if status is error
             "status" : "ok" or "error"
             "data"   : array of data
          }

The clusters and genes have OrthoDB specific ids.

Cluster id
Generic form CLIDatCLADE
CLID is a numerical cluster id
CLADE NCBI taxid of the clade
Example: 124at33208

NOTE prior to OrthoDB 10 the cluster ids were of the form:
Generic form FFFVVCCCCII, where

  • FFF either EOG (eukaryota) or POG (prokaryota)
  • VV OrthoDB version ('09' for both v9 and v9.1)
  • CCCC unique identifier for each clade
  • II unique cluster identifier within the clade clade Example: EOG091G06KN

Gene id
Generic form taxid:geneid
taxid is the NCBI taxonomy id
geneid is a unique zero-padded hexadecimal identifier
Example: 10090:000d08

Using the API
Interacting with the API can be done using either any web browser or a command line tool like 'wget' or 'curl'.

Linux: normally both are installed by default
Windows: wget and curl
Mac: 'curl' is usually installed natively, otherwise look here

Example download fasta for a certain query and save in file 'data.fs' :

wget 'http://www.orthodb.org/fasta?query=doublesex&level=9604' -O data.fs
curl 'http://www.orthodb.org/fasta?query=doublesex&level=9604' -o data.fs

Note the difference in options for specifying output file.

API Commands

/tree

  • Arguments:
    NONE

  • Returns:
    full tree used in OrthoDB

  • Description:
    This retrieves the full tree.

Example

  • Arguments:
    query - full query string
    ncbi - flag: if 0, then generic search, if 1 the query is assumed to be a NCBI gene id
    level - NCBI taxon id of the clade
    skip - number of hits to skip
    limit- maximum nr of hits (cluster ids) to return - default is 1000
    universal - phyloprofile filter, present in 1.0, 0.9, 0.8 of all species in the clade
    singlecopy- phyloprofile filter, singlecopy in 1.0, 0.9, 0.8 of all species in the clade

  • Returns:
    a list of clusters, the maximum number of clusters is defined by 'limit'

  • Description:
    This finds all cluster id's matching a given query.

Example

/blast

  • Arguments:
    all arguments for /search except query and ncbi
    seq - sequence string, without fasta-header
    species- comma separated list of NCBI numerical taxonomy ids
    inclusive- flag: 0 - return clusters containing at least one of given species, 1 - return all matches ignoring the species list (default)

  • Returns:
    list of OrthoDB clusters

  • Description:
    This finds all cluster id's with genes matching the given sequence. The list is sorted with the best matching cluster first.

Example

/group

  • Arguments:
    id - OrthoDB cluster id

  • Returns:
    annotation details on the given cluster id

  • Description:
    Retrieve detailed annotation information on the given cluster.

Example

/orthologs

  • Arguments:
    id- OrthoDB cluster id
    species- optional comma-separated list of species taxid's

  • Returns:
    a dictionary of tax id's, each contain a list of OrthoDB gene id's

  • Description:
    Retrieve all genes in a given cluster, possibly filtered wrt species.

Example

/ogdetails

  • Arguments:
    id - OrthoDB gene id

  • Returns:
    detailed information on the given gene id

  • Description:
    Retrieve further details on a given gene id.

Example

/siblings

  • Arguments:
    id - OrthodDB cluster id
    limit - max nr of returned siblings

  • Returns:
    a list of OrthoDB cluster id's

  • Description:
    Retrieve all siblings to the given cluster.

Example

/fasta

  • Arguments 1:
    id - OrthoDB cluster id
    species - list of NCBI species taxonomy id's

  • Arguments 2:
    all arguments for /search
    species - list of NCBI species taxonomy id's

  • Returns:
    sequences in fasta format
    Note that this query is limited by a maximum of 5000 clusters. If the limit is exceeded, a page is given with basic instructions on how to retrieve the information.

/tab

  • Arguments:
    same arguments as for /fasta
    long - flag: 0 (default) -> without sequence ; 1 -> include sequence

  • Returns:
    tab-separated table of gene annotations

RDF

This SPARQL 1.1 endpoint serves OrthoDB data as RDF. The OrthoDB release 10.1 consists of 2'246'378'105 RDF triples describing evolutionary and functional properties of 40'614'194 genes from 15247 organisms clustered in 8'952'780 orthologous groups on 1004 taxonomic levels.

Downloads

Use API (Application Programming Interface) to download data if the data set is not too large.

OrthoDB data is also available as Flat files for download from here. This is recommended if the user intends to process large parts of the data or /fasta or /tab exceeds the maximum nr of clusters (5000).

OrthoDB Software is available for download from here (MD5: 9fd0d54ff575508be04d12965b14adeb).

  1. Untar the file (`tar -xz OrthoDB_soft_X.tgz). This will create a directory OrthoDB_soft_X with all files.
  2. Then follow the instructions in OrthoDB_soft_X/README.

FAQ

How can I ..?

..will come soon..

Contact

Email: support[at]orthodb.org Join the OrthoDB-News mailing list (low trafic).

Funding

  • UNIGE
  • SIB
  • SNSF

Cite us

OrthoDB v10: sampling the diversity of animal, plant, fungal, protist, bacterial and viral genomes for evolutionary and functional annotations of orthologs Kriventseva EK et al, NAR, Nov 2018, doi:10.1093/nar/gky1053. PMID:30395283

..more & stats

Go to OrthoDB >>