Validation Error Explanations for Genomes

This page has explanations for individual errors that are commonly found during processing of prokaryotic and eukaryotic genomes, along with suggestions to fix them. Write to genomes@ncbi.nlm.nih.gov if you do not know how to correct the error in your submission.

Explanations of disrepancy report problems that are reported in the "discrep" file can be found at https://www.ncbi.nlm.nih.gov/genbank/asndisc#fatal nd https://www.ncbi.nlm.nih.gov/genbank/asndisc.examples/

Remember that annotation is not required for genome submissions, and that you can request NCBI's Prokaryotic Genome Annotation Pipeline for your prokaryotic genome submissions. For more information about annotation, see the Prokaryotic Genome Annotation Guidelines or Eukaryotic Genome Annotation Guidelines.

Error List

SEQ_FEAT_BadCharInAuthorLastName
SEQ_FEAT_BadCharInAuthorName
SEQ_DESCR_BadCollectionCode
SEQ_DESCR_BadCollectionDate
SEQ_DESCR.BadCountryCode
SEQ_FEAT.BadProteinName: Unknown or hypothetical protein should not have EC number
SEQ_INST.BadProteinStart
SEQ_DESCR.BadVoucherID
SEQ_DESCR.BioSourceMissing
SEQ_FEAT.EcNumberProblem
SEQ_FEAT.FeatureBeginsOrEndsInGap
SEQ_FEAT.GenCodeInvalid
SEQ_FEAT.GenCodeMismatch
SEQ_FEAT.IllegalDbXref
SEQ_DESCR.InconsistentMolInfoTechnique
SEQ_INST.InternalNsInSeqRaw
SEQ_FEAT.InternalStop
SEQ_FEAT.InvalidInferenceValue: unrecognized database
SEQ_FEAT.InvalidQualifierValue: rRNA has no name
SEQ_DESCR.LatLonCountry
SEQ_DESCR.LatLonFormat
SEQ_DESCR.LatLonProblem
SEQ_DESCR.LatLonRange
SEQ_DESCR.LatLonValue
SEQ_FEAT.MisMatchAA
GENERIC.MissingPubInfo: No submission citation anywhere on this entire record
GENERIC.MissingPubInfo: Submission citation affiliation has no state
GENERIC.MissingPubInfo
GENERIC.MissingPubRequirement
SEQ_FEAT.MissingTrnaAA
SEQ_DESCR.NoOrgFound
SEQ_DESCR.NoPubFound
SEQ_DESCR.NoSourceDescriptor
SEQ_FEAT.NoStop
SEQ_FEAT.OnlyGeneXrefs
SEQ_FEAT.PartialProblem PartialLocation: Start does not include first/last residue of sequence
SEQ_FEAT.ShortIntron
SEQ_INST.ShortSeq
SEQ_FEAT.StartCodon
SEQ_INST.StopInProtein
SEQ_INST.TerminalNs
SEQ_FEAT.TransLen
SEQ_FEAT.UnknownFeatureQual: orig_protein_id
SEQ_DESCR.UnstructuredVoucher
SEQ_DESCR.UnwantedCompleteFlag
SEQ_DESCR.WrongVoucherType

SEQ_FEAT_BadCharInAuthorLastName

Explanation : An author name has illegal characters.

Suggestion : Check the last names (family names) in the sequence and publication references. Use only plain ASCII text for the names. The last name should NOT contain symbols, numbers, accents, umlauts, characters with diacritical marks, and should NOT end in punctuation. Note that names with internal punctuation such as "St. John" or "D'Abaco" will validate.

examples:

incorrect: Henry Jones., Carlos Méndez, Xu 1Weng

corrected: Henry Jones, Carlos Mendez, Xu Weng

The use of a terminal period and number in these family names causes an error. The error can be corrected by removing the symbols, characters with diacritical marks, numbers, or punctuation.

SEQ_FEAT_BadCharInAuthorName

Explanation : An author name has illegal characters.

Suggestion : Check the first names (given names) in the sequence and publication references. Use only plain ASCII text for the names. The names should NOT contain symbols, numbers, accents, umlauts, characters with diacritical marks, and should NOT end in punctuation. Note that names with internal punctuation such as "St. John" or "D'Abaco" or "Doe-Smith" are okay.

examples:

incorrect: J\#ane Doe, José Perez, 1Xu Weng

corrected: Jane Doe, Jose Perez, Xu Wang

The use of symbols and numbers causes an error. The error can be corrected by removing the symbols, characters with diacritical marks, numbers, or punctuation.

SEQ_DESCR.BadCollectionCode

Explanation: The culture collection is not in the list of registered institutes, or is in the wrong format, or there are multiple culture-collections in a single qualifier.

Suggestion: See the description for the proper format and list of allowed institutes, https://www.insdc.org/controlled-vocabulary-culturecollection-qualifier. Include only the culture-collection from which the sample was obtained. If the sample was deposited into multiple culture-collections, then present each culture-collection in a separate qualifier. If the culture collection is not in the list of allowed institutes, write to us with details of the culture collection. If the error is from a genome that was created from a fasta submission, then the information comes from the BioSample, which will need to be updated. Send the correct source information to us at genomes@ncbi.nlm.nih.gov, and we will update the genome and BioSample. Be sure to include the SUBid of the genome submission and the accession of the BioSample (SAMNxxxxxxxx), if one has been assigned.

Note that culture-collection should be used for microbial sequences, while specimen-voucher should be used for plants and animals only. However, do not use specimen-voucher to describe host information for a microbial sequence submission.

SEQ_DESCR_BadCollectionDate

Explanation: The collection date is not in the required format.

Suggestion: Correct the collection-date source modifier so the date is in the correct format. For example, a collection-date should be formatted like this: DD-MMM-YYYY, where the month is the three-letter code in English. Alternatively, the ISO 8601 standard may be used; see descriptions and examples on the INSDC Feature Table page. If the error is from a genome that was created from a fasta submission, then the information comes from the BioSample, which will need to be updated. Send the correct source information to us at genomes@ncbi.nlm.nih.gov, and we will update the genome and BioSample. Be sure to include the SUBid of the genome submission and the accession of the BioSample (SAMNxxxxxxxx), if one has been assigned.

Examples of correctly formatted collection-dates:

01-Jul-1999
Nov-2010
2008

SEQ_DESCR_BadCountryCode

Explanation: The country code (up to the first colon) is not on the approved list of countries.

Suggestion: Correct the country source modifier with a country name on the approved country list and verify the country value is correctly formatted. If you want to include more specific location information, you must place the approved country name first, followed by a colon and then the additional information. The country has a specific format and must be formatted as follows:

<approved country name>: <region or specific area>

Examples:

Iceland
Canada: Vancouver
Atlantic Ocean: Charlie Gibbs Fracture Zone

If the error is from a genome that was created from a fasta submission, then the information comes from the BioSample, which will need to be updated. Send the correct source information to us at genomes@ncbi.nlm.nih.gov, and we will update the genome and BioSample. Be sure to include the SUBid of the genome submission and the accession of the BioSample (SAMNxxxxxxxx), if one has been assigned.

SEQ_FEAT.BadProteinName: Unknown or hypothetical protein should not have EC number

Explanation: The product name is "hypothetical protein" and there is an EC number.

Suggestion: If this really is a hypothetical protein, simply remove the EC number. If the EC number is correct, use that to provide a valid product name.

SEQ_DESCR.BadVoucherID

Explanation: The voucher is missing a specific identifier.

Suggestion: Correct the format of the culture-collection or specimen-voucher source modifiers. The culture-collection or specimen-voucher is missing the identifier. If the error is from a genome that was created from a fasta submission, then the information comes from the BioSample, which will need to be updated. Send the correct source information to us at genomes@ncbi.nlm.nih.gov, and we will update the genome and BioSample. Be sure to include the SUBid of the genome submission and the accession of the BioSample (SAMNxxxxxxxx), if one has been assigned.

The culture-collection must be formatted like this: <institution-code>:\[<collection-code>:\]<culture id>. The institution code and culture ID are required, the collection-code is optional. The institution code must be valid. See the description for the proper format and list of all allowed institutions.

An example culture-collection is: CBS:1234

Culture-collection should be used for microbial sequences, while specimen-voucher should be used for plants and animals only. Do not use specimen-voucher to describe host information for a microbial sequence submission. The specimen-voucher is not required to be structured.

SEQ_DESCR_BioSourceMissing

Explanation: The biological source of this sequence has not been described correctly. A submission must have a source descriptor that covers the entire molecule. Please add the source information.

Suggestion: Provide an organism name for each sequence in your submission.

SEQ_FEAT.EcNumberProblem

Explanation: Apparent EC number in protein title. A product name includes a value that looks like an EC number, e.g. : "L-pipecolate oxidase (1.5.3.7)"

Suggestion: Remove the EC number from the product name and field it in the EC_number qualifier. If it is something else, e.g. a TC number, then move it to a note.

SEQ_FEAT.FeatureBeginsOrEndsInGap

Explanation: A feature begins or ends in a gap.

Suggestion: Remove the feature or adjust its location to be partial and abut the gap, whichever is appropriate.

SEQ_FEAT.GenCodeInvalid and SEQ_FEAT.GenCodeMismatch

Explanation: The genetic code seems to be invalid or incorrect.

Suggestion: If the organism is a prokaryote, then include -j "[gcode=11]" in the command line to force the use of the prokaryotic genetic code. If the organism is not a prokaryote, then you can ignore this error and we will address it during processing.

SEQ_FEAT.IllegalDbXref

Explanation: The database in the db_xref has the abbreviation or is not one of the allowed databases

Suggestion: If the database that you are using is not one of the allowed databases, then change the db_xref to a note. (However, do no use GI as a db_xref because that is an internal technical database.)

SEQ_DESCR.InconsistentMolInfoTechnique

Explanation: A WGS accession appears to be present but the wgs technique is not set.

Suggestion: You can ignore this error and we will address it during processing. However, you can quiet the error yourself if you wish by including -j "[tech=wgs]" in the command line.

SEQ_INST.InternalNsInSeqRaw

Explanation: A sequence has a run of 100 or more Ns, which is most likely a gap, not a run of ambiguous bases.

Suggestion: Label the run's of N's as assembly_gaps. Choose a smaller length (e.g. 1 or 10) to convert runs of Ns to an assembly_gap with the appropriate linkage evidence. Do not simply remove internal N's.

SEQ_FEAT.InternalStop and SEQ_INST.StopInProtein

Explanation: The InternalStop and StopInProtein errors are produced when there is an internal stop codon within the CDS.

Suggestion: The problem could be the genetic code, the location of the CDS, the reading frame of the CDS, or that the CDS cannot produce an error-free translation. Use the correct genetic code to get the correct translations. For example, include [gcode=11] for prokaryotic genome submissions. If the genetic code is correct, then adjust the CDS location, if possible. If the CDS is partial at its 5' end, then you might need to add a codon_start qualifier with a value of 2 or 3 to shift the reading frame one or two bases, respectively. If the CDS does not have an error-free translation, then add the /pseudo qualifier to the gene to indicate that the CDS cannot be translated.

SEQ_FEAT.InvalidQualifierValue: rRNA has no name

Explanation: rRNA features must have a product name.

Suggestion: Use the appropriate full product name for each rRNA feature, e.g. "16S ribosomal RNA"

SEQ_FEAT.InvalidInferenceValue: unrecognized database

Explanation: The database in the structured inference qualifier is not one of the expected ones.

Suggestion: See the instructions for evidence qualifiers and use one of listed acronyms. If the database that you are referring to is not on the list, then consider including the information as a /note, rather than an /inference.

SEQ_DESCR_LatLonCountry

Explanation: lat_lon and country disagree

Suggestion: The latitude-longitude (lat-lon) value provided does not map to the source country provided, so correct or remove the lat-lon values and/or country source modifiers. Provide lat-lon in decimal degrees with the compass direction (for example: 39.7 N 42.1 W) and check that the lat-lon coordinates map to the country you have provided. If the error is from a genome that was created from a fasta submission, then the information comes from the BioSample, which will need to be updated. Send the correct source information to us at genomes@ncbi.nlm.nih.gov, and we will update the genome and BioSample. Be sure to include the SUBid of the genome submission and the accession of the BioSample (SAMNxxxxxxxx), if one has been assigned.

SEQ_DESCR_LatLonFormat

Explanation: The format of lat-lon should be dd.dd N|S ddd.dd E|W.

Suggestion: Correct the latitude-longitude (lat-lon) source modifier with lat-lon coordinates in decimal degree format with the compass directions. For example: 39.7 N 42.1 W If the error is from a genome that was created from a fasta submission, then the information comes from the BioSample, which will need to be updated. Send the correct source information to us at genomes@ncbi.nlm.nih.gov, and we will update the genome and BioSample. Be sure to include the SUBid of the genome submission and the accession of the BioSample (SAMNxxxxxxxx), if one has been assigned.

SEQ_DESCR_LatLonProblem

Explanation: There is a problem with the lat-lon modifier provided.

Suggestion: Correct or remove the latitude-longitude (lat-lon) values in the source modifiers. Provide lat-lon in decimal degrees and include the compass direction (for example, 39.7 N 42.1 W). Longitude values range from 0 to 180E or 0 to 180W. Latitude values range from 0 to 90 N or 0 to 90 S. If the error is from a genome that was created from a fasta submission, then the information comes from the BioSample, which will need to be updated. Send the correct source information to us at genomes@ncbi.nlm.nih.gov, and we will update the genome and BioSample. Be sure to include the SUBid of the genome submission and the accession of the BioSample (SAMNxxxxxxxx), if one has been assigned.

SEQ_DESCR_LatLonRange

Explanation: Latitude or longitude is out of range.

Suggestion: Correct or remove the latitude-longitude (lat-lon) values in the source modifiers. Provide lat-lon in decimal degrees and include the compass direction (for example, 39.7 N 42.1 W). Longitude values range from 0 to 180E or 0 to 180W. Latitude values range from 0 to 90 N or 0 to 90 S. Numbers outside of these ranges will cause errors. If the error is from a genome that was created from a fasta submission, then the information comes from the BioSample, which will need to be updated. Send the correct source information to us at genomes@ncbi.nlm.nih.gov, and we will update the genome and BioSample. Be sure to include the SUBid of the genome submission and the accession of the BioSample (SAMNxxxxxxxx), if one has been assigned.

SEQ_DESCR_LatLonValue

Explanation: Latitude or longitude values appear to be in the wrong hemisphere or swapped.

Suggestion: Correct or remove the latitude- longitude (lat-lon) values in the source modifiers. The lat-lon value for the record does not agree with the source country provided. Based on the source country, the lat-lon value appears to have the incorrect hemisphere or is swapped. Check the coordinates and compass direction and provide the correct values. If the error is from a genome that was created from a fasta submission, then the information comes from the BioSample, which will need to be updated. Send the correct source information to us at genomes@ncbi.nlm.nih.gov, and we will update the genome and BioSample. Be sure to include the SUBid of the genome submission and the accession of the BioSample (SAMNxxxxxxxx), if one has been assigned.

SEQ_FEAT.MisMatchAA

Explanation: The conceptual translation does not match the provided translation.

Suggestion: Make the CDS partial if it does not begin at the start codon (and extend to end of the sequence for incomplete prokaryotic sequence). Set the genetic code of prokaryotes ( [gcode=11] ) to get the correct translations.

GENERIC.MissingPubInfo: No submission citation anywhere on this entire record

Explanation: There is no submitter block.

Suggestion: Include the template when you create the .sqn submission file. You can create a template here: https://submit.ncbi.nlm.nih.gov/genbank/template/submission/ .

GENERIC.MissingPubInfo: Submission citation affiliation has no state

Explanation: The country is USA, but the state is not included in the affiliation in the submitter block.

Suggestion: Include the state in the template file (for .sqn submissions) or your submission portal profile (for fasta submissions).

GENERIC_MissingPubInfo

Explanation: The publication is missing essential information, such as title or authors.

Suggestion: Check the references. Provide author names, a title, and select the publication status (unpublished, in press, or published). If the title is published or is in press, provide additional information including publication year, journal, volume, and pages, where applicable.

GENERIC_MissingPubRequirement

Explanation: The REFERENCE that includes the submitter information is missing.

Suggestion: Make the template file and call it with the -t argument in the command line: -t template.sbt

SEQ_FEAT.MissingTrnaAA

Explanation: The amino acid that the tRNA carries is not included.

Suggestion: Include the amino acid as the product of the tRNA. If the amino acid of a tRNA is unknown, use tRNA-Xxx as the product. See prokaryotic examples and eukaryotic examples .

SEQ_DESCR.NoOrgFound

Explanation: No organism name is included.

Suggestion: Include the organism information when creating the .sqn file. When running table2asn (or tbl2asn), the organism information can be included in the definition lines of the fasta files or in the command line with -j.

SEQ_DESCR.NoPubFound

Explanation: There is no submitter block or other reference.

Suggestion: Include the template when you create the .sqn submission file. You can create a template here: https://submit.ncbi.nlm.nih.gov/genbank/template/submission/ .

SEQ_DESCR.NoSourceDescriptor

Explanation: There is no source information included.

Suggestion: Include the source by including the information in the fasta headers OR the -j argument in the command line. See the available source modifiers. For a genome you only need to include the organism and strain (for microbes) or organism and breed/ecotype/cultivar and isolate for plants and animals because the information in the BioSample will be added to the genome.

SEQ_FEAT.NoStop

Explanation: The CDS is not marked as partial at its 3′ end and does not end with a stop codon.

Suggestion: Extend the CDS to the stop codon, or mark the 3′ end as partial (and extend the CDS to the end of the sequence for prokaryotic sequences), or add the /pseudo qualifier to the gene to indicate that the CDS cannot be translated.

SEQ_FEAT.OnlyGeneXrefs

Explanation: Features, such as CDS, refer to genes but there are no corresponding gene features.

Suggestion: Include gene features with a unique locus_tag on each gene.

SEQ_FEAT.PartialProblem PartialLocation: Start does not include first/last residue of sequence

Explanation: Since prokaryotes have very little splicing, their features need to be complete or to extend to the end of the sequence and be partial. In eukaryotes this error can be ignored if the partial is at an intron/exon boundary.

Suggestion: Extend the feature one or a few bases to the end of the sequence. If the feature is complete, remove the partial symbols. If this is only a fragment or a nonfunctional gene, change the feature′s location to be complete and add the /pseudo qualifier to the gene.

SEQ_FEAT.ShortIntron

Explanation: The CDS contains an intron shorter than 11bp, which is generally not biologically correct and is usually included to adjust for a frameshift in the sequence.

Suggestion: If the gene is frameshifted but not a pseudogene, then annotate a single gene feature across the entire span and include a pseudo qualifier to indicate that the gene is broken and cannot be translated as expected. In addition, you could include a brief note explaining the problem. If the gene is an actual pseudogene, then add the pseudogene qualifier and the appropriate TYPE to the single gene feature. Alternatively, you can include "-c s" in the table2asn command line, in which case the CDS will have a translation but it will also have the qualifier /artificial_location="low-quality sequence region" and the protein definition line will be prefaced with "LOW QUALITY PROTEIN:"

SEQ_INST.ShortSeq

Explanation: This warning is triggered by proteins that are shorter than ten amino acids. This is probably fine and will not cause problems with your submission, but you should investigate and decide whether you think these really exist.

Suggestion: This is probably fine and will not cause problems with your submission, but you should investigate and decide whether you think these really exist. If there are lots of them and they are just short ORF calls by the annotation tool, then we recommend that you remove them unless you think that they are real.

SEQ_FEAT.StartCodon and SEQ_INST.BadProteinStart

Explanation: The StartCodon and BadProteinStart errors are produced when the CDS is not marked as partial at its 5′ end and does not begin with a start codon.

Suggestion: Use the correct genetic code to get the correct translations. For example, include [gcode=11] for prokaryotic genome submissions. Other possible fixes include: extend the CDS to the start codon, or mark the 5′ end as partial (and extend the CDS to the end of the sequence for prokaryotic sequences), or add the /pseudo qualifier to the gene to indicate that the CDS cannot be translated.

SEQ_INST.TerminalNs

Explanation: There are Ns at the beginning or end to the sequence.

Suggestion: Remove Ns from the beginning and end of the sequence or indicate that the sequence circular, if that is applicable.

SEQ_FEAT.TransLen

Explanation: The length of the protein does not match the provided protein length

Suggestion: Recreate the .sqn file and if the error persists, send your file to us with a description of how you created it and a request to help fix the error.

SEQ_FEAT.UnknownFeatureQual: orig_protein_id

Explanation: An older version of table2asn_GFF was used to convert a GFF file that had pseudo=true or pseudogene=true

Suggestion: Download the current version of table2asn_GFF and use it to make your submission.

SEQ_DESCR.UnstructuredVoucher

Explanation: The voucher needs to be structured as "<institution-code>:[<collection-code>:]<culture id>".

Suggestion: Correct the format of the culture-collection source modifier. The institution code and culture ID are required, the collection-code is optional. Follow the formatting instruction in the explanation. The culture collection must have a valid institution code followed by a colon and the culture ID. See the list of allowed institutes. If the error is from a genome that was created from a fasta submission, then the information comes from the BioSample, which will need to be updated. Send the correct source information to us at genomes@ncbi.nlm.nih.gov, and we will update the genome and BioSample. Be sure to include the SUBid of the genome submission and the accession of the BioSample (SAMNxxxxxxxx), if one has been assigned.

For example CBS:1234

In this example, CBS is the insitution code and 1234 is the culture ID. There must be a colon between the institution code and the culture ID.

SEQ_DESCR.UnwantedCompleteFlag

Explanation: The sequence is listed as complete, but there is missing information elsewhere in the record

Suggestion: You can ignore this error when you have submitted a complete chromosome or plasmid or organelle.

SEQ_FEAT.WrongQualOnImpFeat

Explanation: The feature has an illegal qualifier

Suggestion: Find the legal qualifiers for each feature in the Feature Table .

SEQ_DESCR_WrongVoucherType

Explanation: The institution (or institution: collection) code normally uses a different bio material/culturecollection/specimen voucher type.

Suggestion: In the source modifiers, use the source modifier "culture-collection" instead of "specimen-voucher" or vice versa. For example, if you provided the source modifiers in a tab-delimited table, edit the table so the column header "culture-collection" is used in place of "specimen-voucher" and upload the revised table. If the error is from a genome that was created from a fasta submission, then the information comes from the BioSample, which will need to be updated. Send the correct source information to us at genomes@ncbi.nlm.nih.gov, and we will update the genome and BioSample. Be sure to include the SUBid of the genome submission and the accession of the BioSample (SAMNxxxxxxxx), if one has been assigned.

Note that culture-collection should be used for microbial sequences, while specimen-voucher should be used for plants and animals only. Do not use specimen-voucher to describe host information for a microbial sequence submission.

GenBank

Public nucleic acid sequence repository