NCBI Logo
NCBI News




In this issue


Entrez Query Goes “Global”

Register Your Genome Project Online at NCBI

New Genome Builds and Annotations

Entrez Gene Database Debuts

Recent Publications by NCBI Staff

New Microbial Genomes in GenBank

KOGs and COGs Now in CDD

Submission Corner

GenBank Release 139

UniGene Adds Four

RefSeq Version 3 Released

Masthead





KOGs and COGs Are Now Included In CDD

As part of the latest release (v1.63) of the Conserved Domain Database (CDD), the alignment sets of the KOG1 database (clusters of euKaryotic Orthologous Groups) have been merged into CDD . The KOG database is essentially a eukaryotic version of the COG database (Clusters of Orthologous Groups) that was integrated into CDD in late 2002 (v1.60). KOGs and COGs cluster eukaryotic and prokaryotic proteins respectively into groups containing sequences that are mutual best hits in sequence similarity searches between different species. The KOG database includes proteins from H. sapiens, D. melanogaster, C. elegans, A. thaliana, S. cerevisiae, S. pombe, and E. cuniculi. With RPS-BLAST searches available for KOGs and COGs in CDD, users can now classify query sequences by similarity to these pre-determined sets alongside the alignments from Pfam, SMART, and the curated NCBI Conserved Domains. Because CDD data is also incorporated into Entrez as the Domains database, KOGs and COGs can be found using standard Entrez queries by fields such as title, organism, or text words. With KOGs and COGs now included in CDD, the displays of pre-computed RPS-BLAST results have been updated to reflect the different clustering schemes underlying the several datasets within CDD. CDD now contains datasets that cluster proteins based on overall sequence similarity (COGs and KOGs) along with those that cluster based on the presence of defined functional domains (Pfam, SMART, curated CDs). Multiple domain proteins will therefore often have two sets of hits in CDD: hits from COGs and KOGs to large portions of the sequence, and hits to Pfam, SMART, and/or CDD for each functional domain. In order to show both sets of hits in a simple display, each CDD record is now classified as either a "single" or "multiple" domain record, and the best hits from each set are shown when the Domains link is clicked for a record in Entrez Protein. Moreover, the Conserved Domain Architecture Retrieval Tool (CDART) only uses single domain records to group protein sequences by domain architecture.

Click on image to view larger

Figure 1. Graphical overview of Conserved Domain Search results for human SRC protein, RefSeq accession NP_005408, showing hits to KOG0197 and a PFAM-based conserved domain for tyrosine kinases, as well as hits to SH2 and SH3 domains.

In the example shown above for NP_005408, the human SRC protein, hits are shown to both the multiple domain KOG0197 (tyrosine kinases) and to single domains pfam00018 (SH3), pfam00017 (SH2), and cd00192 (TyrKc, tyrosine kinase catalytic domain).

1Tatusov RL, Fedorova ND, Jackson JJ, Jacobs AR, Kiryutin B, Koonin EV, Krylov DM, Mazumder R, Mekhedov SL, Nikolskaya AN, Rao BS, Smirnov S, Sverdlov AV, Vasudevan S, Wolf YI, Yin JJ, Natale DA. The COG database: an updated version includes eukaryotes. BioMed Central Bioinformatics. 2003 Sep 11 [Epub ahead of print] PMID: 12969510

—ES

 


Continue to:  GEO


NCBI News | Fall/Winter 2002 NCBI News: Spring 2003