COG
Database Grows to 3300 Protein
Clusters and 44 Complete Genomes
Based on research conducted by NCBIs comparative genomics group,
the database of Clusters of Orthologous Groups of proteins (COGs) represents
a phylogenetic classification of proteins encoded in complete genomes.
The COGs are derived from an all-against-all sequence comparison
of the encoded proteins. Each COG consists of individual proteins or
groups of paralogs from at least three lineages and is therefore considered
to correspond to an ancient conserved domain. The database is designed
to support research on genome evolution as well as functional annotation
of genomes.
At its inception in 1997, the database included 720 clusters from 7
genomes. It now includes more than 3300 COGs from 44 genomes of bacteria,
archaea, and the yeast Saccharomyces cerevisiae, representing 30 major
phylogenetic lineages. In addition, proteins from two eukaryotic genomes,
C. elegans and D. melanogaster, have also been assigned to individual
COGs. The COG home page lists the organisms included, number of proteins
encoded by each genome, and the portion of those that are included in
COGs.Three general kinds of information can be obtained using the COG
database. For functional studies, the COGs have been classified into
18 broad functional categories, including one for uncharacterized COGs.
Phylogenetic patterns show the presence or absence of proteins from
a given organism in a specific COG and, when used systematically, can
identify whether a particular metabolic pathway exists in an organism.
Multiple alignments of COG members can be used to identify conserved
sequence residues and analyze evolutionary relationships between member
proteins.
Individual COG reports contain information on the number of proteins
comprising the cluster, their inferred function, a function code from
a list of 18 general categories, the phylogenetic pattern for the COG,
the unique COG number, and a link to proteins from C.elegans and D.melanogaster
assigned to the COG. If available, the pathway or functional system
is also indicated as a functional sub-category. Clicking on the floppy
disk icon will generate a FASTA-formatted file of protein sequences
for all COG members.
The COG report also generates a table giving the gene names corresponding
to cluster members from each organism. Each gene name is linked to a
display of the BLAST output for its encoded protein, which includes
both graphical and textual sequence alignments between the COG member
and other protein database sequences. A Genomic Context link shows the
organization of the genomes of the organisms represented in a COG, centered
on the genes coding for the orthologous proteins that comprise the cluster.
Finally, a dendrogram, constructed from multiple sequence alignments,
displays sequence similarity relationships between the COG members.
A Phylogenetic Patterns search tool finds COGs that are shared
by any set of organisms. Organisms may be included or excluded from
the group using an input table. For closely related organisms belonging
to a single clade, pre-computed tables show shared and unique COGs.
The COGnitor program is a companion tool that assigns new proteins to
pre-existing COGs. COGnitor takes a protein sequence as input for sequence
comparison, and suggests inclusion in a COG if there are best
hits to proteins from at least three lineages. The output shows
the COG to which the query protein is predicted to belong, a color-coded
BLAST graphic delineating the regions of similarity, and the sequence
alignments.
Other
useful resources include:
|
|
|
List
of COGs, which displays all COGs in the database.
|
|
|
Distribution
histograms that show how many COGs contain proteins from a specific
number of clades or species.
|
|
|
Phylogenetic
patterns table, which organizes the patterns into sets based
on the presence or absence of organisms belonging to Archaea,
Eukarya or Bacteria.
|
|
|
Co-occurrences
table, which shows the number of COGs shared by a particular pair
of species or unique to one member.
|
|
|
Functional
categories page, summarizing the functions that have been
defined, the number of COGs assigned to each category, the number
of proteins or domains assigned to each category, and the number
of pathways and functional systems associated with each category.
|
The COGs are also integrated with the Genome division of Entrez. From
the COG pages, proteins are linked to the Genome view and Neighbor view.
From Entrez Genome, proteins are linked to their respective COGs, and
COG data is included in several display options. For example, in the
map display of circular genomes, the radial lines corresponding to genes
are color-coded according to the functional categories used in the COG
system.
The COG service is located at www.ncbi.nlm.nih.gov/COG/.
The data is also available by FTP at ftp://ncbi.nlm.nih.gov/pub/COG.
VP
