NCBI Genomes FTP

NCBI Genomes FTP

The Genomes FTP site offers a consistent core set of files for the genome sequence and annotation products of all organisms and assembled genomes in scope. It supports download needs including:

  • Retrieving the genome sequence for a specific assembled genome

  • Retrieving GenBank or RefSeq Gene, RNA, and protein annotation for a specific organism and a specific assembled genome, or a specific RefSeq annotation release

  • Retrieving annotation in GenBank flat file, GFF3, or GTF format

  • Retrieving RefSeq annotation for mitochondria, plastids, and plasmids

  • Assembly summary reports containing metadata for the latest and historical GenBank and RefSeq assembled genomes

  • Matching set of sequence (FASTA) and annotation (GFF3 or GTF) files with identical sequence identifiers to facilitate reproducible analyses

  • MD5 checksums to ensure downloaded content is complete

FAQs

What is the easiest way to download data for one or more assembled genomes?

  • Using NCBI Datasets. This is the most user-friendly way to download genome data. Please see NCBI Datasets Documentation for more details.
  • From the Genomes FTP site: Users interested in additional files that are currently not included in the Datasets package (see table) can browse the Genomes FTP site to download them piecemeal or download in bulk using command-line tools such as lftp and rsync.

How can I download only the current annotation for an organism?

Most users will want to download data only for the latest annotation for the reference assembled genome of an organism. This data is available in NCBI Datasets.

How can I stay informed about changes to the NCBI genomes FTP site?

Subscribe to the Genomes-announce mailing list or follow the NCBI Insights blog.

Are files on the FTP site updated following annotation updates?

All new annotation releases are published to the Genomes FTP site.

How can I download older annotation files?

NCBI Datasets delivers the latest annotation for any assembled genome version. In some cases, when an assembled genome is annotated multiple times by NCBI and users need data for a specific older annotation release for that assembled genome, they can download it from the annotation_releases directory on the Genomes FTP site.

What is the file content within each specific assembled genome directory?

Directories for all current assembled genomes, and for many previous versions, include a core set of files, plus additional files relevant to the specific assembled genome. Directories for old, assembled genome versions that predate the genomes FTP site reorganization contain only the assembly report, assembly stats, and assembly status files (see table).

Table: Sequence and Annotation Files Available on Genomes FTP

FileFormatDescription
*_ani_contam_ranges.tsv [G/R]Tab-delimited textReports potentially contaminated regions in the assembly identified based on Average Nucleotide Identity (ANI).
*_ani_report.txt [G/R]Tab-delimited textReports Average Nucleotide Identity (ANI) based evaluation of the taxonomic identity of the assembly.
*_assembly_regions.txt [R]Tab-delimited textReports the location of genomic regions and lists the alt/patch scaffolds placed within those regions.
*_assembly_report.txt [G/R]Tab-delimited textReports the name, role, and sequence accession.version for objects in the assembly.
*_assembly_stats.txt [G/R]Tab-delimited textReports statistics for the assembly.
*_cds_from_genomic.fna.gz [D/G/R]FASTANucleotide sequences corresponding to all CDS features annotated on the assembly, based on the genome sequence.
*_feature_count.txt.gz [G/R]Tab-delimited textReports counts of gene, RNA, CDS, and similar features based on data reported in the *_feature_table.txt.gz file.
*_feature_table.txt.gz [G/R]Tab-delimited textReports locations and attributes for a subset of annotated features.
*_gene_expression_counts.txt.gz [R]Tab-delimited textReports counts of RNA-seq reads mapped to each gene.
*_normalized_gene_expression_counts.txt.gz [R]Tab-delimited textReports normalized counts (TPM) of RNA-seq reads mapped to each gene.
*_gene_ontology.gaf.gz [R]GO Annotation File (GAF)Gene Ontology (GO) annotation of the annotated genes.
*_genomic.fna.gz [D/G/R]FASTAGenomic sequence(s) in the assembly. Repetitive sequences in eukaryotes are masked to lower-case.
*_genomic.gbff.gz [G/R]GenBank flat fileGenomic sequence(s) in the assembly.
*_genomic.gff.gz [D/G/R]GFF3Annotation of the genomic sequence(s).
*_genomic.gtf.gz [D/G/R]GTFAnnotation of the genomic sequence(s).
*_genomic_gaps.txt.gz [G/R]Tab-delimited textReports the coordinates of all gaps in the top-level genomic sequences.
*_protein.faa.gz [D/G/R]FASTASequences of accessioned protein products annotated on the genome assembly.
*_protein.gpff.gz [G/R]GenPept flat fileSequences of accessioned protein products annotated on the genome assembly.
*_pseudo_without_product.fna.gz [R]FASTAGenomic sequence corresponding to pseudogene and other gene regions which do not have any associated transcribed RNA products or translated protein products.
*_rm.out.gz [R]TextRepeatMasker output (provided for some eukaryotes).
*_rm.run [R]TextDocumentation of the RepeatMasker version, parameters, and library (provided for some eukaryotes).
*_rna.fna.gz [D/R]FASTASequences of accessioned RNA products annotated on the genome assembly.
*_rna.gbff.gz [R]GenBank flat fileRNA products annotated on the genome assembly (provided for RefSeq assemblies as relevant).
*_rna_from_genomic.fna.gz [G/R]FASTANucleotide sequences corresponding to all RNA features annotated on the assembly, based on the genome sequence.
*_rnaseq_alignment_summary.txt [R]Tab-delimited textReports counts of alignments classified by Subread featureCounts.
*_rnaseq_runs.txt [R]Tab-delimited textInformation about RNA-seq runs used for gene expression analyses.
*_translated_cds.faa.gz [G/R]FASTAIndividual CDS features annotated on the genomic records, conceptually translated into protein sequences.
*_wgsmaster.gbff.gz [G]GenBank flat fileWGS master for the assembly (present only if a WGS master record exists for the sequences in the assembly)
.annotation_hashes.txt [G/R]Tab-delimited textReports hash values for different aspects of the annotation data
.assembly_status.txt [G/R]TextReports the current status of this assembly version
.md5checksums.txt [G/R]TextFile checksums are provided for all data files in the directory.
*_knownrefseq_alns.bam (RefSeq_transcripts_alignments sub-directory) [R]BAMAlignments of the annotated Known RefSeq transcripts (identified with accessions prefixed with NM_ and NR_) to the genome.
*_knownrefseq_alns.bam.bai (RefSeq_transcripts_alignments sub-directory) [R]BAM IndexIndex of the BAM alignments of the annotated Known RefSeq transcripts to the genome.
*_modelrefseq_alns.bam (RefSeq_transcripts_alignments sub-directory) [R]BAMAlignments of the annotated Model RefSeq transcripts (identified by accessions prefixed with XM_ and XR_) to the genome.
*_modelrefseq_alns.bam.bai (RefSeq_transcripts_alignments sub-directory) [R]BAM IndexIndex of the BAM alignments of the annotated Model RefSeq transcripts to the genome.
*_compare_prev.txt.gz (Annotation_comparison sub-directory) [R]Tab-delimited textAnnotation comparison report.
*_cross_species_tx_alns.gff.gz (Evidence_alignments sub-directory) [R]GFF3Alignments of cDNAs, ESTs, and TSAs from other species to the genomic sequence(s).
*_same_species_tx_alns.gff.gz (Evidence_alignments sub-directory) [R]GFF3Alignments of same species cDNAs, ESTs, and TSAs to the genomic sequence(s).
*_gnomon_model.gff.gz (Gnomon_models sub-directory) [R]GFF3Gnomon annotation of the genomic sequence(s).
*_gnomon_protein.faa.gz (Gnomon_models sub-directory) [R]FASTAGnomon protein models annotated on the genome assembly.
*_gnomon_rna.fna.gz (Gnomon_models sub-directory) [R]FASTAGnomon transcript models annotated on the genome assembly.
*_graph.bw (RNASeq_coverage_graphs directory) [R]UCSC BigWigRNA-seq read coverage graphs. Alternative style: subdir/*_file.txt

D: Datasets; G: Genbank; R: RefSeq

* Sample file path with either a GCA or GCF prefix where each hashtag represents a number in the actual path: https://ftp.ncbi.nih.gov/genomes/all/GC[A/F]/###/###/###/GC[A/F]_#########_(assembly name)/GC[A/F]_#########_(assembly name)

Where can I find information to help me choose between the many different assembled genomes for a species?

Many different assembled genomes are available for species with medical, agricultural, or scientific relevance. The Genus_species directories under the “genbank” and “refseq” directory trees each contain an assembly_summary.txt file that provides general information on all assembled genome versions included in the directory such as release date, submitter organization, assembly level, and annotation status. For example, see ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/archaea/Sulfolobus_islandicus/assembly_summary.txt.

After assembled genomes of interest have been identified using the data from the species-specific assembly_summary.txt file, they can be accessed via the “all_assembly_versions” directory for that species. Alternatively, any assembled genomes the NCBI Reference Sequence Database (RefSeq) selects as reference or representative genome can be readily accessed via the directories named “reference” or “representative” in the Genus_species directories under the “genbank” and “refseq” directory trees.

Do you provide assembled genome data formatted for use by sequence read alignment pipelines?

Genomic FASTA with modified sequence identifiers and index files convenient for analysis with Next Generation Sequencing tools are currently provided for the Genome Reference Consortium’s human and mouse assembled genomes GRCh38, GRCm38.p3, and GRCm39. RefSeq annotation in GFF3 and GTF formats with sequence identifiers matching those in the FASTA files are also provided to facilitate use in RNA-Seq analysis pipelines.

The four analysis sets provided for GRCh38 (no_alt_analysis_set, full_analysis_set, full_plus_hs38d1_analysis_set, and no_alt_plus_hs38d1_analysis_set) and the two analysis sets provided for GRCm38.p3 (no_alt_analysis_set and full_analysis_set) differ from the corresponding full assembled genomes by one or more of the following:

  • omission of alternate locus and patch scaffolds that cause complications for sequence read alignment programs that are not alternate contig aware (alt-aware)
  • hard masking of duplicate copies of pseudo-autosomal regions and centromeric arrays
  • addition of “decoy” sequences

Additionally, index files generated by BWA, Samtools, Bowtie, and HISAT2 are also provided. See the GRCh38 README, GRCm38 README, or GRCm39 README for a full description.

Are repetitive sequences in eukaryotic genomes masked?

Yes. All genome sequences are softmasked using WindowMasker or RepeatMasker. For genomes that are masked using RepeatMasker, an additional file with information about the masked regions is also provided (see table).

Generated May 1, 2024