Frequently Asked Questions for Genomes

Can I submit an assembly and have it held back until I publish my paper?
Where do I submit my viral genome assemblies?
Do I have to register a separate BioProject for each genome I am sequencing?
Do I need to submit my genome assembly with annotation?
Does NCBI have an annotation pipeline that can be used to annotate my assembly?
If I do have my own annotation, in what format should I provide this data?
Can I submit annotation as a GenBank flatfile?
I'm using next generation sequencing technology. Can I still submit an assembly?
Do I need to split the sequences at the Ns that were inserted by the assembler?
What should I use for the gap sizes?
I concatenated the sequences into the correct order with Ns between each sequence and annotated this pseudomolecule. Can I submit this annotated pseudomolecule?
I concatenated the sequences in a random order with Ns between each sequence and annotated the pseudomolecule. Can I submit the annotated pseudomolecule?
Can I annotate across gaps?
My genome assembly has contigs and scaffolds. Should I submit the annotation on the contigs or the scaffolds?
I want all of the WGS contigs in my assembly available to users. Should I put unlinked WGS contigs into the AGP?
How do I submit the separate haplotypes that were created from the reads of a diploid/polyploid genome?
How do I submit a prokaryotic or eukaryotic genome assembled from metagenomic reads (a MAG)?
Can I submit RAST annotation?

Can I submit an assembly and have it held back until I publish my paper?

Yes, you may submit your assembly and have it held until publication. You will select a release date, and your genome will be released on that day or when it is publicly available, whichever is first. If needed, you can write to genomes@ncbi.nlm.nih.gov to request a change of the release date.

Note that release of the genome will automatically trigger the release of its BioProject and BioSample. However, the reverse is not true; the release of a BioProject or BioSample will not automatically trigger the release of associated data.

Where do I submit my viral genome assemblies?

Virus sequences are submitted to GenBank via the appropriate option on the BankIt page.

Do I have to register a separate BioProject for each genome I am sequencing?

If multiple genomes are part of the same research effort, then they should belong to the same BioProject. However, each sample must be registered as a separate BioSample.

Be sure to use the same BioProject and BioSample for the assembled genome and for the sequence reads that were used to assemble it.

Do I need to submit my genome assembly with annotation?

No, you can submit the genome without any annotation. However, during the genome submission you may request that a prokaryotic genome assembly be annotated by NCBI's Prokaryotic Genome Annotation Pipeline before its release into GenBank.

Does NCBI have an annotation pipeline that can be used to annotate my assembly?

You can request that NCBI annotate prokaryotic genomes using our Prokaryotic Genome Annotation Pipeline during the submission process.

In addition, you can download and run PGAP yourself before submission, if desired.

The NCBI Eukaryotic Genome Annotation pipeline is not available as a GenBank submitter resource. See its annotation policy for details.

If I do have my own annotation, in what format should I provide this data?

To submit the annotation, you need to create a .sqn file in ASN format that combines the annotation and sequence, allowing for validation to check that these are consistent with each other without errors. The basic description is at https://www.ncbi.nlm.nih.gov/genbank/genomesubmit/#sqn.

As described there, the annotation input can be in either the 5-column feature table (.tbl) format OR as a GenBank-specific GFF3 file. You then run table2asn to create the .sqn file and validate it.

Prokaryotic Annotation Guidelines with .tbl examples
Eukaryotic Annotation Guidelines with .tbl examples
Using GFF3 file as input

Note that our set of RAST conversion scripts are able to convert some .gb flatfile formats into a GenBank submission.

Can I submit annotation as a GenBank flatfile?

In general, we cannot accept annotation as a GenBank, EMBL or DDBJ flat file. To submit annotation, see this FAQ

However, you might be able to use the RAST conversion scripts to make the correct file for submission from a .gb file, although there may still be problems that need to be fixed to create a GenBank submission.

I'm using next generation sequencing technology. Can I still submit an assembly?

Yes, you may submit assemblies using second or third generation sequencing technology. The primary reads should be submitted to the Sequence Read Archive. The reads should be assembled into contigs and submitted as described in the submission instructions. These WGS contigs can be used to assemble higher order molecules and submitted to GenBank genomes either as gapped scaffold sequences or as contigs plus an AGP file, as described in the submission instructions.

Do I need to split the sequences at the Ns that were inserted by the assembler?

No, you do not need to split properly assembled sequences. However, sequences that have been concatenated in random, unknown order are not allowed. For example, we cannot accept a single sequence of all the unplaced sequences (eg, a chromosome Un).

During the submission process you will be asked to indicate what the Ns in the genome sequence represent. The default answers are that 10 or more Ns in a row represent a gap and that "paired-ends" is the evidence that the sequences on either side of each gap are linked. If those answers are not correct, then you need to provide the correct answers in the submission form. During processing of the submission, those runs of Ns will be converted to assembly_gap features. Note that NCBI's Assembly resource counts runs of 10 or more Ns as a gap, regardless of whether they have been converted to a gap during processing of the genome.

The original/traditional submission format, of splitting the sequences at the runs of Ns into contigs and rebuilding the scaffolds with an AGP file, remains a submission option.

What should I use for the gap sizes?

If you have estimates of the gap sizes, then use those values for the gaps in the AGP file. We prefer that you use 10 as the minimum gap size, to be more of a signal to database users. If you do not have an estimate of the gap size, then the preference is to use 100 as the value and the 'U' in column five of the AGP file, indicating that the gap size is unknown.

If there is no annotation, then you can submit the fasta file and answer the questions about the Ns in the sequence. The default answers are that 10 or more Ns in a row represent a gap and that "paired-ends" is the evidence that the sequences on either side of each gap are linked. If those answers are not correct, then you provide the correct answers in the submission form. During processing those runs of Ns will be converted to assembly_gap features. Note that NCBI's Assembly resource counts runs of 10 or more Ns as a gap, regardless of whether they have been converted to a gap during processing of the genome.

For more complicated submissions with annotation on gapped sequence, follow the Gapped submission cases and instructions.

I concatenated the sequences into the correct order with the Ns between each sequence and annotated the pseudomolecule. Can I submit this annotated pseudomolecule?

Yes, you can submit this gapped submissions. However, you will need to include the correct gap and linkage evidence for each run of Ns that represents a gap. You can make the appropriate gaps with table2asn, as described, and use .tbl or GenBank-specific .gff file as the annotation input.

I concatentated the sequences in a random order with Ns between each sequence and annotated this pseudomolecule. Can I submit the annotated pseudomolecule?

No, you cannot. Since the sequence does not correspond to a biological molecule, you need to split the pseudomolecule into the contig sequences and submit those as the pieces of a wgs project. You will need to map the annotation down to the contig level, but can use an offset in the .tbl file to avoid recalculating locations, if desired.

Can I annotate across gaps?

Protein translations are allowed to cross gaps of estimated size, but not those of unknown sizes. That is, introns can be in gaps of unknown size, but not exons. However, annotation across gaps is discouraged unless there is evidence that the translation on the other side of the gap is in the correct frame. In addition, if >50% of the translation is Xs (i.e. in the gap) then the CDS should be made partial at the gap, or split into two partial CDSs, as described for genes split across two contigs, depending upon the confidence of the translation on both sides of the gap.

My genome assembly has contigs and scaffolds. Should I submit the annotation on the contigs or the scaffolds?

Eukaryotic genomes, which usually have thousands of contigs and hundreds or thousands of scaffolds, should be annotated at the scaffold level.

I want all of the WGS contigs in my assembly available to users. Should I put singleton WGS contigs into the AGP?

When a genome submission includes an AGP file, that file defines the assembly. Therefore, typically we do want all of the WGS contigs in the AGP file. However, contigs that are not considered to be part of the assembly, perhaps because they are degenerate or duplicates, should not be included in the AGP file. In addition, remove from the submission any sequences that are shorter than 200 bp and are not part of multi-component scaffolds.

How do I submit the separate haplotypes that were created from the reads of a diploid/polyploid genome?

When the assembly methods were able to generate separate assemblies of the haplotypes of a diploid/polyploid genome, submit them according to the instructions at Submitting multiple haplotype assemblies

How do I submit a prokaryotic or eukaryotic genome assembled from metagenomic reads (a MAG)?

Description: You isolated DNA from an environmental or mixed sample and then binned and assembled the sequences to create individual prokaryotic or eukaryotic metagenome-assembled genomes (MAGs). Each assembly must:

represent the genome from a single prokaryotic or eukaryotic organism reconstructed from the metagenomic mix
include all the identified genome sequence (ie, you have not intentionally removed noncoding regions or included only the sequences for a single kind of gene)
have a CheckM or CheckM2 of at least 90%
have a total size of at least 100,000 nucleotides

Note that you should only use sequences that you have determined yourself. Do not include sequences you have only downloaded from a public depository. The raw reads should be submitted to the Sequence Read Archive (SRA) and the contigs made from overlapping reads can be submitted as a genome assembly.

(1) You will need to register a BioProject for this research effort. You can use this one BioProject for all the data associated with this study.

(2) You will need the SRA run accessions for the reads used to create the MAGs. SRA data is organized into 4 levels:

STUDY: accessions begin with SRP,ERP,DRP
SAMPLE: accessions begin with SRS,ERS,DRS
EXPERIMENT: accessions begin with SRX,ERX,DRX
RUN: accessions begin with SRR,ERR,DRR

Please provide the 'run' accessions for the individual reads that were used for each MAG. The accessions should start with SRR, ERR, or DRR. You will need the SRA accessions when you create the MAG BioSamples in step (5).

Alternatively, if you sequenced the data but did not submit the reads to SRA, instead of providing the SRA accessions, you can use a BioSample that represents the mixed sample from which the DNA was isolated. Register this physical sample in the BioSample database. Select either the NCBI "Metagenome or environmental" package or the GSC MIxS "MIMS Environmental/Metagenome" package. Use a metagenome organism name that describes the sample from which the DNA was isolated (eg soil metagenome or gut metagenome). Choose one of the metagenome names that is already present in the NCBI Taxonomy database. If you did not submit the reads to SRA, you will need this physical metagenome BioSample when you create the MAG BioSamples in step (5).

(3) Please provide a unique alpha-numeric code to distinguish each MAG assembly. We will add the identifier as an isolate but we realize each organism was metagenomically binned and not isolated. The isolate will be a stable identifier used only for a single MAG that will not change over time. We do not recommend including the organism name or an abbreviation of the organism name in the isolate, because the organism name may be updated if additional work is done to characterize the MAG in the future, but the isolate will not change. Do not include SRA accessions as part of the isolate. The isolate should be a series of letters or numbers that serve as an identifier for your organism assembly. For example, how do you identify this assembly in your laboratory notebook? If you don't have another identifier, you could use something like MAG1, MAG2. You will need the isolate when you register the MAG BioSample in step (5).

(4) We will need an organism name for each MAG. The organism names should be taxonomically meaningful, at the lowest rank that is reliable (division, phylum, class, order, family, genus or species) and in the NCBI Taxonomy database. Note that NCBI does not utilize unpublished ad hoc taxonomic names from other databases such as Silva or GTDB. Therefore, before registering the BioSamples, please email to genomes@ncbi.nlm.nih.gov a list of organism names you plan to use and we will verify that those names are in the NCBI taxonomy database or are appropriate to be added to the database. Please provide a table with:

Column 1: the unique isolate name from step (3)
Column 2: the organism names you would to like to use OR the GTDB lineage in the original unmodified format

We can use the unmodified GTDB lineage to determine the best NCBI tax name, but our tool will not work if the format from GTDB has been modified. Here is an example of the correct format:

Isolate	GTDB lineage
MAG1	d__Bacteria;p__Proteobacteria;c__Gammaproteobacteria;o__Enterobacterales;f__Enterobacteriaceae;g__Escherichia;s__Escherichia coli_D
MAG2	d__Bacteria;p__Firmicutes_A;c__Clostridia;o__Monoglobales_A;f__UBA1381;g__12844;s__
MAG3	d__Bacteria;p__Myxococcota;c__Polyangia;o__Haliangiales;f__Haliangiaceae;g__;s__
MAG4	d__Bacteria;p__Patescibacteria;c__ABY1;o__SG8-24;f__2-12-FULL-60-25;g__;s__

We will confer with the NCBI taxonomists and return to you the organism names to use when you create the MAG BioSamples in step (5).

(5) Use the names that we return to you in step (4) to create organism-specific MAG BioSamples. You should create one BioSample for each MAG. When you create the MAG BioSamples:

choose the “MIMAG Metagenome-assembled Genome” package
include the BioProject ID PRJNAxxxx you created in step [1]
include as much source information as you can (eg, geo_loc_name, collection-data, lat-lon, isolation-source, etc.). The information should agree with the corresponding SRA data or physical BioSample(s). If multiple samples were used, include the common information. For example, if samples were collect on 3 different dates in the same year, just use the year as the collection-date of the MAG.
include a unique isolate name from step [3]
include sample_type=metagenomic assembly
in the derived_from attribute, list the SRA accessions for the reads used to create the MAGs (see step [2]). If there is more than one read accession for a MAG, list all the accessions separated by commas. Do not hyphenate the list. Alternatively, if you did not submit the reads or if your protocol precluded knowing which reads were used to assemble the MAG, you can provide the SAMN id for the physical BioSample(s) that represents the mixed sample from which the DNA was isolated as described in step [2].

If you have several MAG BioSamples, you can use a table to upload all the BioSample information. From the BioSample registration page select "Download batch template". Choose the “MIMAG Metagenome-assembled Genome” package and select "download". Fill in this template and then upload it using the "Batch/Multiple BioSamples" option when you create a new BioSample submission. Alternatively, you can provide this information in the embedded table within the BioSample submission form. Note that you can only create 1000 BioSamples in a single table. If you have more than 1000 MAGs, you will need to divide the table into separate BioSample submissions. If you are planning to submit 5000 or more MAGs, please write to genomes@ncbi.nlm.nih.gov so we can review your BioSamples before you begin submitting the genome files.

(6) Prepare the genome sequences. In the fasta header of each sequence, include the SRA read accessions (SRR,ERR,DRR) of the reads that were used to assemble the MAG (see step (2)). For example:

>contig1 [SRA=SRRxxxxxx,SRRxxxxxy]

(7) After you have created the BioProject and the BioSamples, you are ready to submit the data using the genome submission portal.
Submit each MAG assembly as a separate row in a batch submission using the BioProject ID PRJNAxxxxx from step [1] and the BioSample ID SAMNxxxxxxxx for the individual MAG from step (5). Note that because we run several validation checks on each genome assembly, a single batch submission cannot contain more than 400 assemblies. If you have more than 400 MAGs, you will need to divide them into separate batches.

(8) Annotation is not required; however, you may be interested to know that NCBI has a publicly available Prokaryotic Genomes Annotation Pipeline (PGAP). PGAP can be used for prokaryotic MAGs but not for sequences that are identified only as metagenomes. You can request PGAP annotation during submission of the MAG to GenBank, or you can run PGAP yourself and submit a GenBank-ready file. Note that we do not have a publicly available eukaryotic annotation pipeline.

Can I submit RAST annotation?

We have a prototype that will convert flatfile formats created by outside programs for prokaryotes into a 5-column feature table. However, there may still be problems because GenBank-type files from other sources often contain qualifiers that are not recognized by GenBank so they cannot be converted. Conversely, features or qualifiers that are required by GenBank may be missing. In addition, there may be errors such as internal N's representing gaps, invalid translations or unacceptable protein names that need to be addressed.

To convert the flatfile (.gb) file from RAST to a .sqn file for GenBank submission, get the scripts from the scripts directory on the NCBI ftp site: https://ftp.ncbi.nih.gov/toolbox/ncbi_tools/converters/scripts/

gbf2tbl.pl
rast2sqn.sh
rastbatch.sh
tblfix.pl

In addition, provide the following:

a template file (from https://submit.ncbi.nlm.nih.gov/genbank/template/submission/)
flatfile from RAST (*gb)
locus tag prefix (whatever is registered in BioProject for this genome)
protein_id prefix (an abbreviation of your lab name that you think will be unique)

usage:

./rast2sqn.sh template flatfile locus_tag_prefix protein_id_prefix

for example:

input:

flatfile = TEST.gb
template file = template.sbt
locus_tag prefix = AAA
protein_id_prefix = xx

commandline:

./rast2sqn.sh template.sbt TEST.gb AAA xx

output:

TEST.sqn
TEST.fsa
TEST.tbl
TEST.val = validation
errorsummary.val = summary of validation
TEST.dsc = discrepancy report
TEST.err = qualifiers that couldn't be converted
TEST.ecn = EC_numbers that are not found at ftp://ftp.expasy.org/databases/enzyme/enzyme.dat
TEST.fixedproducts = product names found by the discrepancy report Typo, Hypothetical protein, and American spelling tests that are automatically corrected

You will need to review the validation and discrepancy reports, as described in the 3) Check the output of the validation and discrepancy report and fix problems section under the 'see details' hyperlink.

Make any necessary corrections to the starting .gb file and re-run the script. Alternatively, you can edit the .tbl file and then run table2asn as described to create a .sqn file for submission.

Submit the .sqn file, as described.

GenBank

Public nucleic acid sequence repository