PubMed Entrez BLAST Protein Clusters Books TaxBrowser Entrez Structure

Overview

This database consists of all proteins from complete genomes that are used in the protein cluster database (ProtClustDB). We have precalculated clusters of similar proteins at the genus-level and one representative is chosen from each cluster in order to reduce the dataset. The result is reduced search times through the elimination of redundant proteins while providing a broader taxonomic view.

Back to Concise Protein BLAST.

About this database

The vast increase in genomic sequences has led to a flood of data to the protein databases as well. Many strain-specific genomes are now being sequenced (for example Streptococcal genomes). The result can be an overwhelming amount of data to look through when executing BLAST similarity searches. In order to help alleviate both the processing of the data and to present a broader taxonomic view, the concise protein database was constructed.

All proteins from complete genomes are compared by BLAST (all against all). Protein clusters are constructed from the results such that all top blast hits are combined into a single cluster (each protein in a given cluster must have all other proteins in the cluster as top BLAST hits).

Clusters may span a large taxonomic branch (kingdom) or may reside at a specific node (family, genus, species, etc.). Clusters may consist of many proteins, or be comprised of only two proteins. From this entire set of clusters, genus-specific clusters are used for this database. From the proteins at the genus-level, one (randomly selected) is chosen as a representative for the Concise Microbial Protein BLAST database and will be found in BLAST queries. The other proteins in the cluster are automatically linked to this representative and will also be found in the search results, although without the BLAST score and E-value as they are not specifically examined. All proteins that do not belong to the genus-level clusters are also added to the database for completeness.

The result will be faster processing times and reduced load on the database. The broader taxonomic view will help eliminate some of the redundancy that is found when many proteins of closely related organisms are found in BLAST results.

Return to top

Query Page

Queries can be either protein or nucleotide using blastp and blastx programs, respectively. Accessions, GIs, or sequences in FASTA format can be entered in the query box.

Default parameters are set below the query box. The expect threshold is set low, which will help reduce BLAST results. Information on each parameter is available by clicking on each name.

Return to top

Results Page

The results page is not the one typically returned for BLAST results although a link is provided to view the results in standard format.

The query is shown, along with the length, and the number of hits for total proteins, and the proteins represented by the genus-level clusters.

Results are returned in a collapsed table format. Genus level clusters are represented with a plus (+) sign at each level, which can be expanded. The table is sortable by organism name and by BLAST score. As only one protein from a genus-level cluster is chosen, there will be no BLAST score nor E-value for the other proteins in a cluster as they are not searched when a query is submitted.

The table shows the organism name, protein name, accession, length, locus_tag, BLink, BL2seq, score, and E-value. Links to taxonomy (organism name), protein (protein name, accession), and gene (locus_tag) entrez databases are available for each protein. The BLink shows precomputed BLAST results for that specific protein, while the BL2seq link runs a comparison of the query and that specific protein in the table using BLAST.

Return to top

BLAST Help

Help and information on BLAST are available from the main BLAST page.

Microbial genomes can be searched here.

Return to top
NCBI Home NCBI Search NCBI SiteMap
Last modified Sept 6, 2006