FAQs

Questions and answers for common NCBI Datasets questions

FAQs

Questions and answers for common NCBI Datasets questions

Why are the NCBI Datasets CLIv13.x and older and API v1 being deprecated and retired?

The NCBI Datasets API v1 and Command Line Tool (CLI) v13.x and older versions along with API v1 are being retired to allow us to focus our attention on improved features in the newer versions. Retirement ensures our users can access the latest advancements and maintain an efficient experience.

What are the benefits of migrating to CLI v16+ and API v2alpha?

Migrating to CLI v16+ and API v2alpha offers several advantages, including access to enhanced functionality, improved performance, and ongoing support, ensuring a better user experience.

When will API v1 and CLI v13 deprecation and retirement occur?

The deprecation is set for June 2024, with retirement planned for December 2024. During this period, users are advised to migrate to the more recent versions, CLI v16+ and API v2alpha. API v2alpha will transition to API v2 and reach stability by June 2024.

Will my existing scripts and workflows using API v1 and CLI v13 continue to work after retirement?

No, these older versions will no longer be functional after API v1 and CLI v13 are retired. It is crucial to migrate to CLI v16+ and API v2alpha or API v2 to ensure uninterrupted access and functionality.

What will happen to the Python and R libraries with the deprecation and retirement of API v1 and CLI v13?

The NCBI Datasets Python and R libraries that rely on API v1 and CLI v13 will no longer function after their retirement. Users are encouraged to explore the updated documentation on using the Datasets API with various programming languages. The documentation will provide guidance on interacting with the API using your preferred language, ensuring a seamless transition to the latest versions of NCBI Datasets.

How do I download SRA data?

Unfortunately, retrieval of SRA data, including retriveal of SRA data by BioProject accession, is not supported by NCBI Datasets. To download SRA data for a given BioProject, we recommend using the SRA Toolkit. For example, to obtain SRA data in FASTQ format for the BioProject accession PRJNA648656, try the following:

  1. Download and install the SRA Toolkit.
  2. Run prefetch PRJNA648656 to download SRA data in .sra format
  3. Run fasterq-dump PRJNA648656 to extract FASTQ files from the downloaded .sra files.

For more information, see the SRA Tools Wiki pages, Downloading SRA Toolkit and How to use prefetch and fastwerq-dump to extract FASTQ-files

How is datasets command-line tools (CLI) version 14+ (CLI v14+) different from version 13 (CLI v13.x) and previous versions?

The new datasets CLI v14+…

  • Provide easier access to metadata
  • Contain smaller data packages (faster downloads)
  • Offer expanded content for virus genomes
  • Deliver genome sequences as a single file by default
  • Use simpler command syntax (data files are now included using the --include flag)
    You can get more information about new features and other updates in our release notes in GitHub.

Easier access to metadata
All metadata can now be printed to the screen (using the command datasets summary), redirected to a file, or piped to dataformat to generate a customized table (using the command dataformat tsv). Additionally, metadata formats have been standardized across services, and all metadata schemas are now documented. Previously, some metadata was only available as part of a downloaded data package.

Smaller data packages
Data packages now include a smaller set of files by default, so downloads are faster and more reliable. For example, the default genome data package now includes only genome sequence and the data report file. You also have the option to include all other sequence, annotation and report files.

Expanded content for virus genomes
All genomes in NCBI Virus are now available through Datasets.

Genome sequences are now delivered as a single file
CLI v14+ now delivers genome sequences as a single file by default. You also have the option to request genome sequences as separate files by chromosome using --chromosomes.

Simpler command syntax
CLI v14+ offers a simpler way to request specific data files and data reports (metadata) compared to previous CLI versions. Data files can be specified using a single --include flag instead of multiple exclude flags. For example, genome and protein sequences for the current human reference genome can be downloaded using:
datasets download genome taxon human --reference --include genome,protein

You can also add additional data reports to the data package using the --include flag.

Why does --exclude not work in CLI v14+?

We have removed the multiple --exclude flags from CLI v14+ in favor of a single --include flag. Data package content can be customized by specifying the desired data or data reports (metadata) after the --include flag. Combined with changes to the contents of our default data packages, requesting the data you want is simpler and more intuitive. For example, to get genome and protein sequences for the human reference genome, try the following:

datasets download genome taxon human --reference --include genome,protein

Which version of the documentation should I use?

With the release of datasets version 14, we now have two documentation versions. The first CLI v16+ (API v2alpha) version describes the latest version of datasets CLI (v16.x) and the underlying API (v2alpha). The second CLI v13.x (API v1) describes the previous version of datasets (v13.x) and the underlying API (API v1). You can opt for your preferred documentation version or toggle between the two versions using the drop-down options on the left side of each documentation page.

CLI v16+ (API v2alpha) documentation
CLI v16+ (API v2alpha) describes the latest version of datasets and the underlying API. Please refer to this latest documentation if you are using the latest version of datasets and dataformat v14+, or are using the latest version of the NCBI Datasets API (v2alpha).

CLI v13.x (API v1) documentation
CLI v13.x (API v1) describes the previous version of datasets and the underlying API. Please refer to this documentation version if you are using previous versions of datasets and dataformat v13.x or earlier, or are using the previous version of the NCBI Datasets API (v1).

We recommend you upgrade to the latest version of the CLI. However, in certain scenarios your workflow or code may stop working if you upgrade to the latest version due to breaking changes in teh CLI syntax, data report schemas, and/or default data package file contents. In such instances, you may choose to continue using previous versions of the CLI.

Where is the data I requested?

Your data is in the subdirectory ncbi_dataset/data/ within the zip archive you downloaded.

I still can’t find my data, can you help?

We have identified a bug affecting Mac Safari users. When downloading data from the NCBI Datasets web interface, you may see only a README file after the download has completed (while other files appear to be missing). As a workaround to prevent this issue from recurring, we recommend disabling automatic zip archive extraction in Safari until Apple releases a bug fix. For more information, visit: Mac Safari zip archive bug

What file formats can be downloaded using NCBI Datasets?

Datasets offers the following file formats (if available for the requested query):

  • Sequence files in FASTA format: genomic/gene, transcript and protein nucleotide sequences
  • Annotation files: GTF, GFF3, and GBFF
  • Metadata files: JSON and JSON Lines

What is a data package?

A “data package” is an NCBI Datasets zip archive that contains sequence, annotation, metadata and other biological data. For more information, see Data packages.

What is a dehydrated data package and what is rehydration?

A dehydrated data package is a zip archive that contains only metadata and the location of sequence and other data files on NCBI servers.
Rehydration is the process of downloading the data itself.
Downloading a dehydrated data package and rehydrating it is the best way to download large genome data packages containing either > 1,000 genomes or > 15 GB of data. For more information, see our How-to guide on how to download large genome data packages.

Why do gene counts differ when comparing taxonomy and species pages to the gene table?

Gene counts on the taxonomy and species pages are derived from the annotation report. The annotation report and other genome annotation files represent a snapshot of the genome at the time of genome annotation. In contrast, the gene table and gene data obtained from the datasets (datasets download gene...) contains current gene data, including unannotated genes, genes created after the last annotation, as well as any updates made to existing genes after the last annotation. For some model organisms, particularly human, frequent manual curation means that current gene data is likely to differ compared to the most recent annotation.

What are atypical genomes?

Genome warning message

Atypical genomes are genomes with one or more problems that have been identified by NCBI relating to quality, unusual size, or other flaws in the genome assembly. See atypical assemblies for the list of problems that result in an affected genome being designated as atypical.

On individual genome pages, atypical genomes can be identified by the presence of a warning icon, consisting of a yellow triangle containing an exclamation point, and the type of genome problem at the top of the page (see image above).

On genome table pages, atypical genomes with one or more problems are identified with a warning icon, consisting of a yellow triangle containing an exclamation point, next to the genome assembly name. The type of genome problem(s) is shown when you hover on the warning icon. Atypical genomes can be excluded from the genome table display by selecting the “Exclude atypical genomes” checkbox in the Filters section of the page.

The assembly data report includes two fields that indicate whether a genome is atypical and the specific type of genome problem. When a genome problem has been identified, atypical.is_atypical is true, and atypical.warnings will include the type of genome problem(s).

How does NCBI decide which genomes to annotate?

Only genomes with assemblies that are publicly available in INSDC (DDBJ, ENA or GenBank) are considered for inclusion in RefSeq and processing by the eukaryotic genome annotation pipeline. NCBI makes this selection based on several factors. For more information, see Genomes Selected for RefSeq Annotation.

What is the difference between a GenBank (GCA) and RefSeq (GCF) genome assembly?

A GenBank (GCA) genome assembly contains assembled genome sequences submitted by investigators or sequencing centers to GenBank or another member of the International Nucleotide Sequence Database Collaboration (INSDC). The GenBank (GCA) assembly is an archival record that is owned by the submitter and may or may not include annotation. A RefSeq (GCF) genome assembly represents an NCBI-derived copy of a submitted GenBank (GCA) assembly. RefSeq (GCF) assembly records are maintained by NCBI. In some cases the RefSeq (GCF) assembly may not be completely identical to the GenBank (GCA) assembly due to assembly improvements made by NCBI staff. All RefSeq (GCF) genome assemblies include annotation.

GCA vs GCF table

Generated May 21, 2024