For information on Eukaryotic Genome Annotation and Assembly, go here.
For information on the prokaryotic submission check tool go here.
The standalone package is available on FTP.
For specific instructions, check the README file.
For specific procedures, check the NCBI Annotation Procedures.
Overview |
 |
The Prokaryotic Genomes Automatic Annotation Pipeline (PGAAP) is currently under development.
The pipeline is intended for use during the annotation of genomes in preparation for submission to
GenBank, and
several external groups have used the NCBI Annotation Pipeline to prepare their submissions.
The pipeline is capable of annotating complete genomes as well as WGS genomes consisting of
multiple contigs (at least 200 bases per contig).
The pipeline has been used in RefSeq project to improve the annotation of complete microbial genomes (Daraselia et al.,
2003).
If you are interested in using PGAAP please contact us at
NCBI Genomes
For detailed instructions, view the README file.
Pipeline |
 |
The PGAAP combines HMM-based gene prediction methods with a sequence similarity-based approach which combines
comparison of the predicted gene products to the non-redundant protein database, Entrez Protein Clusters
, the Conserved Domain Database, and the COGs (Clusters of Orthologous Groups).
Submitters requesting the use of the annotation pipeline for their genomic sequences submit them to NCBI in fasta format.
Gene predictions are done using a combination of GeneMark and Glimmer (Borodovsky and McIninch; 1993;
Lukashin and Borodovsky, 1998; Delcher et al., 1998). A short step resolving
conflicts of start sites is done at this point. Ribosomal RNAs are predicted by sequence similarity searching
using BLAST against an RNA sequence database and/or using Infernal and Rfam models. Transfer RNAs are predicted using tRNAscan-SE (Lowe and Eddy, 1997). In order to detect missing
genes, a complete six-frame translation of the nucleotide sequence is done and predicted proteins (generated above)
are masked. All predictions are then searched using BLAST against all proteins from complete microbial
genomes. Annotation is based on comparison to protein clusters and on the BLAST results. Conserved Domain Database and Cluster of
Orthologous Group information is then added to the annotation. Frameshift detection and cleanup occurs and then the final output is then sent back to the submitters
who can then analyze the results in preparation for submission to GenBank.
End Products |
|
The end product of the annotation pipeline can be used to submit to GenBank.
For each genomic contig annotation results include:
- DNA FASTA - *.fsa files
- Feature table in Sequin format - *.tbl files
- ASN.1 produced from pairs of table and FASTA sequence files - *.sqn files
- GenBank format produced from the ASN.1 - *.gbf files
For Bacterial Genome Submission Guidelines, see
this page.
Supplementary data available upon request for futher manual evaluation analysis of the annotation results
- Blast results of predicted proteins against NCBI non-reduntant protein and protein clusters databases
- Domain assignments for each protein by runing rps-BLAST against CDD database
- COG assignments produced by using Cognitor against COG database.
README and Submission |
|
It is essential that the submission is in the proper format before we can proceed. This README file
shows the correct steps and file formats.
Anyone wishing to submit sequences to the annotation pipeline must contact us first at:
NCBI Genomes
References |
 |
1. GeneMark
Borodovsky M and McIninch J.
GeneMark: Parallel Gene
1993. Recognition for both DNA Strands. Comput. Chem. 17: 123-133.
2. GeneMark.hmm
Lukashin A. and Borodovsky M.
1998. GeneMark.hmm: new solutions for gene finding.
Nucleic Acids Res. 26: No. 4, pp. 1107-1115.
PMID: 9461475
3. GeneMarkS
Besemer, J., Lomsadze, A., and Borodovsky, M.
2001. GeneMarkS: a self-training method for prediction of gene starts in microbial genomes. Implications for finding sequence motifs in regulatory regions.
Nucleic Acids Res. 26: No. 4, pp. 1107-1115.
PMID: 11410670
4. Glimmer
Delcher A L, Hormon D, Kasif S, White O and Salzberg S L.
1999. Improved microbial gene identification with GLIMMER.
Nucleic Acids Res. 27: 4636-4641.
PMID: 10556321
5. Shewanella oneidensis
Daraselia N, Dernovoy D, Tian Y, Borodovsky M, Tatusov R, Tatusova T.
2003. Reannotation of Shewanella oneidensis genome.
OMICS. 25: Summer 7(2):171-5.
PMID: 14506846
6. Rfam
Griffiths-Jones, S., Bateman, A., Marshall, M., Khanna, A., and Eddy, S.R.
1997. Rfam: an RNA family database.
Nucleic Acids Research, 2003, 31, 1, 439-441.
PMID: 15608160
7. tRNAscan-SE
Lowe, T.M. & Eddy, S.R.
1997. tRNAscan-SE: a program for improved detection of transfer RNA genes in genomic sequence.
Nucl. Acids Res. 25: 955-964.
PMID: 9023104
8. Infernal
Eddy, S.R.
2002. A memory-efficient dynamic programming algorithm for optimal alignment of a sequence to an RNA secondary structure.
BMC Bioinformatics. 3: 18.
PMID: 12095421
9. Prokaryotic Submission Tool
Pruitt, K, et al.
2008. NCBI Reference Sequences: current status, policy and new initiatives.
Nucleic Acids Research, 2008, Epub.
PMID: 18927115
10. Protein Clusters
Klimke, W, et al.
2008. The National Center for Biotechnology Information's Protein Clusters Database..
Nucleic Acids Research, 2008, Epub.
PMID: 18940865
Revised
Feb 6, 2008
Disclaimer
Privacy
statement NCBI Service Desk
|