Retrieve ortholog data and metadata

Retrieve ortholog data and metadata

Retrieve ortholog data and metadata

Retrieve ortholog data and metadata

Quick overview

Gene orthologs can be retrieved by gene-id, accession or symbol using the --ortholog flag.

  • gene-id and accession are unique identifiers. As a consequence, the associated taxon is implied. For example: for the human BRCA1 DNA repair associated gene and its gene orthologs in cat and Florida manatee:
  • symbol is not a unique identifier (human and cat have the same symbol), so it’s necessary to specify a taxon. datasets uses human as default species.
Speciesgene-idaccessionsymbol
Human672NM_007297.4BRCA1
Cat101081937XM_019817934.2BRCA1
Florida manatee101356605XM_023725233.1LOC101356605
 

In the examples below, we will use datasets datasets download and datasets summary commands. In short, datasets summary returns only metadata in JSON or JSON-Lines format, while datasets download retrieves a gene data package including both metadata and sequence files.

The --ortholog flag

The --ortholog flag serves two purposes:

  1. It explicitly requests an ortholog set for a gene-id, accession or symbol.
  2. It defines the taxonomic scope of the ortholog set.

The --ortholog flag requires an argument after it. The options are:

  • --ortholog all: this option returns the complete ortholog set available for the requested gene, with no filter.
  • --ortholog <any taxon>: here, the user can define the taxonomic range for the requested ortholog set. We have an example below showing how to filter an ortholog set by taxon .

Simplest example: retrieve one gene ortholog set

All of the following commands will download the same gene ortholog set:

datasets download gene gene-id 672 --ortholog all
datasets download gene symbol brca1 --ortholog all
datasets download gene accession NM_007297.4 --ortholog all

Retrieve multiple gene ortholog sets based on a gene list

NCBI datasets can retrive multiple ortholog sets based on a list of symbols, accessions or gene-ids. Currently, datasets does not separate each ortholog set into its own files. All sets will be saved in a single data package.

For example: if we provide a list of gene-ids (one per line or comma-separated) using the flag --inputfile, datasets will iterate over those and save the results as a single data package.

$ cat genelist.txt
672
4157
3206

$ datasets download gene gene-id --inputfile genelist.txt --ortholog all --filename ort.zip
$ unzip ort.zip -d ort
$ tree ort
ort
|-- README.md
`-- ncbi_dataset
    `-- data
        |-- data_report.jsonl
        |-- data_table.tsv
        |-- dataset_catalog.json
        |-- protein.faa
        `-- rna.fna

If we want each ortholog data package to be saved separately, we can use a loop instead:

Command:

cat genelist.txt | while read GENE; do
    datasets download gene gene-id "${GENE}" --ortholog all --filename "${GENE}".zip;
done

Result:

Collecting 319  records [===============================================>] 100% 318/319
Collecting 318  records [================================================] 100% 318/318
Downloading: 672.zip    4.11MB done
Collecting 271  records [================================================] 100% 271/271
Collecting 271  records [================================================] 100% 271/271
Downloading: 4157.zip    312kB done
Collecting 431  records [===============================================>] 100% 430/431
Collecting 430  records [================================================] 100% 430/430
Downloading: 3206.zip    587kB done

In this case, the list of genes must have one gene-id per line.

Filter an ortholog gene set by taxon

NCBI datasets offers an option to filter the ortholog set by taxon (any level) by specifying it after the flag --ortholog. For example: you can filter the BRCA1 (gene-id 672) ortholog set to include only members of the otter family Mustelidae:

You can get a list of species in the otter family for which gene orthologs of human BRCA1 have been calculated using datasets summary with dataformat:

datasets summary gene gene-id 672 --ortholog mustelidae --as-json-lines | dataformat tsv gene --fields tax-name

Output:

Taxonomic Name
Enhydra lutris kenyoni
Mustela erminea
Lontra canadensis
Neogale vison
Mustela putorius furo
Meles meles
Lutra lutra
Mustela lutreola
Mustela nigripes

The full BRCA1 ortholog set includes 306 species, while the Mustelidae set has only 9 species.

Alternatively, you can download a data package for these otter family gene orthologs:

datasets download gene gene-id 672 --ortholog mustelidae --filename mustelidae.zip
Collecting 9 gene records [================================================] 100% 9/9
Downloading: mustelidae.zip    171kB valid zip archive
Validating package files [================================================] 100% 5/5

Retrieve an ortholog set by symbol using the --taxon flag

By default, datasets will assume the taxon to be human (Taxonomy ID: 9606) when requesting an ortholog set by symbol. If we request an ortholog set by symbol for which no human gene is included in the ortholog set, we get an error without the --taxon flag. For example, when we query by the mouse gene symbol syna, we get the following result:

$ datasets summary gene symbol syna --ortholog all
{"total_count": 0}

If we specify mouse (TaxId: 10090) with the flag --taxon, then datasets will return the syna ortholog set:

$ datasets summary gene symbol syna --taxon 10090 --ortholog all

How to retrieve ortholog metadata

Using dataformat

In addition to datasets, we have the dataformat command-line tool that can be used to extract metadata from the gene data report included with the data packages or accessible through the datasets summary command. Create a tsv file from the datasets summary JSON-Lines output using dataformat

datasets summary gene symbol brca1 --ortholog all --as-json-lines | \
dataformat tsv gene --fields tax-name,gene-id,symbol,group-id > brca1.tsv
head brca1.tsv

Result:

Taxonomic Name          NCBI GeneID     Symbol  Gene Group Identifier
Sus scrofa              100049662       BRCA1   672
Equus caballus          100051990       BRCA1   672
Taeniopygia guttata     100224649       BRCA1   672
Oryctolagus cuniculus   100347269       BRCA1   672
Callithrix jacchus      100388186       BRCA1   672
Pongo abelii            100439533       BRCA1   672
Ailuropoda melanoleuca  100480891       BRCA1   672
Anolis carolinensis     100553919       brca1   672
Nomascus leucogenys     100580360       BRCA1   672
Generated May 16, 2024