The NCBI Genome Assembly Data Model

The NCBI Genome Assembly Data Model

NCBI provides stable accessions and data tracking for genome assemblies. It stores the names and identifiers for the sequences in each genome assembly as well as the associated metadata (such as assembly name, date of submission, name of submitter, details of sequenced organism), assembly statistics, and the organization of its component sequences into scaffolds and chromosomes. NCBI uses the Genome Reference Consortium (GRC) data model, reflecting the complexity of the modern genome assembly (see Figure 1 below). The data model accounts for all sequences known to represent an organism’s genome, including those that are not yet assigned to chromosome assemblies. Assemblies may be of different levels (contig level, scaffold level, chromosome level, or complete genomes).

All assemblies contain a unit termed “primary assembly.” This includes non-redundant sequences (chromosomes and/or scaffolds) that represent an organism’s haploid genome. Additional assembly units that may be included are organelle genomes (mitochondria, plastids), alternate loci (sequences aligned to the primary assembly that provide alternate representations of corresponding loci found in the primary assembly), and genome patches that are sequences representing assembly updates.

NCBI Assembly Data Model
Figure 1: A graphical representation of the NCBI assembly model showing the genome of a eukaryote with three assembly units: a primary assembly containing two nuclear chromosomes, a mitochondrial genome, and an alternate locus group containing sequences from two chromosomes.

Details about the assembly model are available in the publication “Assembly: a resource for assembled genomes at NCBI.” For more information on terms used in the model and descriptions of additional categories of genome assemblies, please visit the Glossary.

Referencing Genome Assembly Data

When assemblies are submitted to NCBI’s GenBank or another member database of the International Nucleotide Sequence Database Collaboration (INSDC), individual sequences associated with the assembly are assigned a unique accession number followed by a dot and a version (e.g., CM000663.2). If a sequence is updated, the accession will remain the same, but the version will increase, thus allowing updates to be tracked. Metadata updates do not result in a version increment.

In much the same way, the NCBI assigns an unambiguous accession.version at the assembly level, which is specific to the precise set of individual sequences that are part of the genome assembly. A change in version at the assembly level indicates sequence updates to one or more component sequences. To learn more, see the Assembly Versioning and Status documentation. NCBI uses identifiers to distinguish between GenBank assemblies represented with a ‘GCA_’ prefix and RefSeq assemblies that begin with a ‘GCF_’ prefix. The latter are copies of GenBank assemblies used by RefSeq to annotate genome assemblies. To learn more about the RefSeq annotation process, see Eukaryotic Genome Annotation at NCBI.

Generated May 7, 2024