Prokaryotic RefSeq Genomes Frequently Asked Questions (FAQ)

Related documentation
Why has NCBI discontinued some prokaryotic Gene records?
How do I find a Gene record for the same species as a discontinued record, or for a given non-redundant protein?
Will discontinued Gene records continue to be accessible?
Do discontinued Gene records include information about suppressed NP_ and YP_ accessions that have been replaced with a non-redundant WP_ accession?
Why has NCBI removed so many bacterial protein records with the accession prefix NP_ or YP_?
Why have the locus_tags changed on RefSeq bacterial genomes, compared to the submitted GenBank entry?
How do I find the best replacement for a discontinued locus_tag?
Why are some type strain bacterial genomes not designated as a RefSeq reference genome?
How do I find a replacement for a removed protein accession?
How do I find the nucleotide coding sequence (CDS) for a non-redundant protein record?
How do I find the list of genomes that include a CDS annotation that cross-references a given non-redundant protein accession?
What should I do if I think the name given to a non-redundant RefSeq protein in the DESCRIPTION line or the protein /product line is wrong?
Can I add a publication to a non-redundant protein record?
How can I review what genes are annotated nearby for a given non-redundant RefSeq protein?
How can I access species- or strain-specific protein datasets?

Why has NCBI discontinued some prokaryotic Gene records?

NCBI is re-annotating all RefSeq archaeal and bacterial genomes to improve consistency across these datasets. As part of this process, NCBI is providing Gene records only for reference genomes for a given prokaryotic species. Pre-existing Gene records for archaeal and bacterial genomes that are not in the above sets have been, or will be, discontinued. Notes are being placed on affected Gene records to provide more details.

How do I find a Gene record for the same species as a discontinued record, or for a given non-redundant protein?

In some cases, a previous Gene record has been tracked as a replacement by an orthologous record from a related strain. In these cases the original record includes an information message that links to the replacement Gene entry. For entries that have not been tracked in this manner, it is still possible to find related Gene entries using NCBI links from proteins to Gene as follows:

Navigate to the protein record of interest in NCBI's Protein database
Notice the "Related information" section in the right column of the page, follow the link to "Identical Proteins" or to "Related Sequences"
In the "Find Related data" section of the right column, select the Gene database, then click the "Find Items" button.

If this approach does not return a Gene record then you can also try to find a related Gene record by doing a blastp query against the "Reference Proteins (refseq_protein)" database. Follow the link in the BLAST results 'Related information' panel to the Gene record.

Will discontinued Gene records continue to be accessible?

Yes, discontinued records are still available in the Gene resource. Discontinued records are not updated with the exception of the graphical display in the "Genomic regions, transcripts, and products" section of the page which will show the current annotation for the RefSeq accession.version and coordinates. If that RefSeq genome has also been suppressed then this display will not change. If the RefSeq genome continues to be public and undergoes future annotation updates, then the annotation of that sequence range may change and be automatically presented on the discontinued Gene entry. Additional updates are made at times to the informational messages that appear at the top of discontinued Gene records. We are working to add an improved message to the large set of bacterial Gene records that were suppressed in the first quarter of 2015. Additional information is available on the RefSeq bacterial re-annotation project page.

Do discontinued Gene records include information about suppressed NP_ and YP_ accessions that have been replaced with a non-redundant WP_ accession?

An information message will be added to the top of the Gene full report for the set of bacterial Gene entries that were removed in the first quarter of 2015, which corresponds to the RefSeq bacterial complete genome re-annotation project and the revised definition of scope for Gene. The provided message includes information on the new locus_tag and replacement non-redundant protein accession when available. This informational message will become available shortly after the FTP installation of comprehensive RefSeq release 70 (May 2015).

Why has NCBI removed so many bacterial protein records with the accession prefix NP_ or YP_?

NCBI has implemented a new data model for managing prokaryotic genomes to address concerns about data redundancy. This new management plan provides non-redundant RefSeq protein records, with an accession prefix 'WP_'. At the end of 2014 and into the first quarter of 2015 we re-annotated the RefSeq bacterial complete genomes; this resulted in the removal of nearly 7 million NP_ and YP_ accessions as these genomes were updated to directly cross-reference the new non-redundant protein records (WP_ accessions). All RefSeq prokaryotic genomes that are annotated with a CDS which translates to the identical protein sequence are now being annotated with a non-redundant protein accession. An exception is made for as subset of RefSeq prokaryotic 'reference genomes' which continue to be annotated with 'NP_' or 'YP_' accessions which in turn cross-reference a non-redundant protein accession. All other prokaryotic genomes will be annotated with non-reduntant WP_ accessions.

Why have the locus_tags changed on RefSeq bacterial genomes, compared to the submitted GenBank entry?

Locus_tags have changed on most prokaryotic RefSeq genomes as a result of re-annotation by NCBI using the Prokaryotic Genome Annotation Pipeline (PGAP). Before the re-annotation, the annotation available on most RefSeq bacterial genomes was identical to that available in the submitted GenBank genome and thus it was appropriate to retained locus_tags from the submitted genome annotation. However, with the re-annotation project in some cases CDS coordinates have changed, unsupported CDSs have been removed, or new supported CDSs have been added; therefore, it was necessary to provide new locus_tags in order to comprehensively report this data type for RefSeq bacterial genomes.

How do I find the best replacement for a discontinued locus_tag?

In many cases, the original locus_tag is still annotated on the current RefSeq bacterial genome record as an /old_locus_tag qualifier along with the new locus tag for the gene feature.

One example is the annotation of the 30S ribosomal protein S9 gene on the Pseudomonas fluorescens A506 genome (NC_017911.1). The FEATURES table shows both the discontinued locus_tag, PflA506_0831, and the current locus tag, PFLA506_RS04145.

A supplemental report (release70.bacterial-reannotation-report) is provided for FTP with RefSeq release 70 which includes information to support mapping protein accessions as well as locus_tags. This report is initially provided in the FTP RefSeq release-catalog directory area and will be moved to the release-catalog/archive/ directory when the July 2015 RefSeq release 71 is installed.

Why are some type strain bacterial genomes not designated as a RefSeq reference genome?

Type strain information is considered when the selection of reference genomes is made, but the genomic sequence data must still be of sufficient quality to be a reference RefSeq genome (RefSeq genome annotation criteria). If you have questions about type strain genomes that you feel are of high quality but are not tracked as a RefSeq reference genome, please write to info@ncbi.nlm.nih.gov with quality supportive details so we can review the situation.

How do I find a replacement for a removed protein accession?

Removed protein accessions can still be accessed in NCBI's Protein resource when querying by protein accession or gi. A custom message has been provided with links to the replacement non-redundant protein for a subset of suppressed NP_ and YP_ protein accessions. These suppressed accessions have been replaced by, and are identical to, a non-redundant protein record; the same nucleotide accession.version + coordinates that uses to cross-reference a NP_ or YP_ accession not cross-references a non-redundant WP_ accession. We are working to expand this navigation support for annotation updates that resulted in a small change to the CDS feature coordinates such that the original NP_/YP_ accession are very similar to but not identical to the replacement non-redundant WP_ accession.

Image of informative message added to suppressed bacterial YP/NP accessions.

How do I find the nucleotide coding sequence (CDS) for a non-redundant protein record?

Retrieve the record in the Protein database and then click the link a the top of the page to the Identical Protein table (or use the Display Settings menu to navigate to this view). In the table, find the organism that you want and use the link in the CDS Region in Nucleotide column to link to the region of the genomic sequence that corresponds to the coding region (CDS) for the protein in the organism of interest.

How do I find the list of genomes that include a CDS annotation that cross-references a given non-redundant protein accession?

From a given non-redundant protein accession, change the Display Settings to the Identical Protein Report. A link is provided to this display at the top of the record view, where the link to FASTA format is also found. The Identical Protein Report page shows the nucleotide sequence and CDS annotation location that the non-redundant protein is annotated on. The report also includes the organism information for the nucleotide record, and lists additional protein sequences that are identical in sequence to the RefSeq non-redundant protein.

The tabular report can be downloaded by using the Send to link on the upper right side of the page (Send to -> File).

What should I do if I think the name given to a non-redundant RefSeq protein in the DESCRIPTION line or the protein /product line is wrong?

Please write to the help-desk (info@ncbi.nlm.nih.gov), describe the problem and provide the evidence you have for a new protein name.

Can I add a publication to a non-redundant protein record?

At this time, publications cannot be added to non-redundant protein records as they represent the pure sequence object which is found, in many cases, on genomes from multiple strains or even species.

How can I review what genes are annotated nearby for a given non-redundant RefSeq protein?

From a non-redundant protein record, navigate to the Identical Proteins report page (using the link at the top of the record, or the Display Settings menu). Select the organism of interest from the table, follow the link in the "CDS Region in Nucleotide". You can review the annotation for an expanded region in the GenBank format by changing the Region shown, or view the annotation in a graphical context by following the link to "Graphics". Once you are in the graphics view you can zoom out to view a graphical display of the neighboring gene annotations.

How can I access species- or strain-specific protein datasets?

Species- or strain-specific protein datasets for individual RefSeq genomes can be obtained online, by FTP, and through NCBI s programming utilities. To access data online, navigate to the annotated genome record(s) in NCBI s Nucleotide database, use the right-column option to Find related data in the Protein database, then download the protein records using the upper-right Send to wizard.

To access proteins for specific species or strains by FTP, navigate to NCBI's Datasets resource for that species or strain then either use the link to the RefSeq FTP site where you can download sequence and annotation data in a variety of formats.

To access data using NCBI programming utilities one must provide the genomic accession(s) and use the eLink function to access the linked protein data (see documentation http://www.ncbi.nlm.nih.gov/books/NBK25501/).

RefSeq

Integrated reference sequences