NOTE: Links within this document are mostly internal, except for links to references or the PSI-BLAST web page. Thus, clicking on a "List of COGs" link will not send you to that list; rather, it sends you to the relevant section of the help document.
Introduction to COGs Using COGs Selecting COGs COG names Protein names Terminology/Glossary
Information Tools
Phylogenetic patternsInput page Output page Output page (user-provided query) Output page (COG database-provided query)
|
|
| What are COGs? |
| How are COGs created? |
| Where can I get more information? |
| What kind of information can be obtained using the COG database? |
| How do I find a particular protein in the COG database? |
| How can a particular set of COGs be selected? |
| Are there ways to combine criteria to select a subset of COGs? |
| What should I know about COG names? |
| What do the various abbreviations in COG names stand for? |
| What should I know about protein names? |
| What is the significance of an underscore and a number appended to a protein name? |
| How were genes named with respect to the species of origin? |
| What terminology will I need to know to use these pages effectively? |
"COG" stands for Cluster of Orthologous Groups of proteins. The proteins that comprise each COG are assumed to have evolved from an ancestral protein, and are therefore either orthologs or paralogs. Orthologs are proteins from different species that evolved by vertical descent (speciation), and typically retain the same function as the original. Paralogs are proteins from within a given species that are derived from gene duplication, and may evolve new functions that are related to the original. See references for more information.
COGs were identified using an all-against-all sequence comparison of the proteins encoded in completely-sequenced genomes. In considering a protein from a given genome, this comparison would reveal the one protein from each of the other genomes to which it is most similar (hence the need for using complete genomes1 to define COGs). Each of these proteins are in turn considered. If a reciprocal best-hit relationship between these proteins (or a subset) is revealed, then those that are reciprocal best-hits will form a COG2. Thus, a member of a COG will be more similar to other members of the COG than to any other protein from the compared genomes, even if the absolute similarity is low. The use of the best-hit rule, without the constraint of an arbitrarily-chosen statistical cut-off, therefore accomodates both slow- and fast-evolving proteins. However, one constraint that was imposed is that a COG must be comprised of one protein from at least three phylogenetically distant genomes.1 Applies only to forming COGs, not to obtaining information about new proteins.
2 For simplicity, several steps were omitted here. See references for details.
Where
can I get more information?
The following references will provide additional and more detailed information. You may also send questions to info@ncbi.nlm.nih.gov.Tatusov et al. (1997). A genomic perspective on protein families. Science 278: 631-637.
Koonin et al. (1998). Beyond complete genomes: from sequence to structure and function. Curr. Opin. Struct. Biol. 8: 355-363.
Galperin et al. (1999). Comparing microbial genomes: How the gene set determines the lifestyle. In Organization of the Prokaryotic Genome, R.L. Charlebois, Ed. (American Society of Microbiology, Washington, DC) pp. 91-108.
Tatusov et al. (2000). The COG database: a tool for genome-scale analysis of protein functions and evolution. Nucleic Acids Res. 28: 33-6.
Briefly, there are three general kinds of information:1. Annotation of proteins. Known functions (and two- or three-dimensional structures) of one COG member can often be directly attributed to the other members of the COG. Caution must be used here, however, since some COGs contain paralogs whose function may not precisely correspond to that of the known protein.
2. Phylogenetic patterns. These show the presence or absence of proteins from a given organism in a specific COG. Used systematically, such patterns can be used to identify whether a particular metabolic pathway exists in an organism.
3. Multiple alignments. Each COG page includes a link to a multiple alignment of COG members, which can be used to identify conserved sequence residues and analyze evolutionary relationships between member proteins.
How do I find a particular protein in the COG database?
There are two ways to do this. The first makes use of the protein/gene name search feature found on the COG home page and the "List of COGs" page. However, this method is not foolproof, since some genes may be known by alternative names. A more robust method is to paste the appropriate sequence into COGnitor.
Subsets of COGs can be selected based on function code, phylogenetic pattern, or by number of represented organisms. Function code refers to the major cellular process(es) to which a COG is relevant, while phylogenetic pattern refers to the organisms that are included in the COG. Number of represented organisms refers to the number of clades or species that contribute at least one protein to the COG. Selection of COGs based on the number of clades or species can be done from the COG distribution page.
Are there ways to combine criteria to select a subset of COGs?
The following criteria can be combined. The last two ("Species" and "Group species") are not available on all pages (one will take the place of the other). Note that these selections must be defined simultaneously to narrow down the list of COGs.
Function Selects COGs that belong to a particular functional category. Multiple categories can be selected, which will return a boolean "or" result (for example, entering "CG" will return COGs that are either in the "C" category or the "G" category, not necessarily only those that belong to both). Input must be specified using uppercase letters. Text Selects COGs based on the entered text string. For example, any COG whose name contains "membrane" can be selected. Be aware that this will not guarantee that all membrane COGs will be selected, only those that contain this word in the name. Input is case-insensitive, and will search only the COG number and COG name fields of the list. Species Selects COGs that have a phylogenetic pattern that fits the species selected. Species that must be present in the COG are entered as lowercase letters (in arbitrary order), while those that must be absent are listed after a dash (-). Any species for which no preference is indicated can either be present or absent in the COGs. You can also select COGs that contain one or more of the species indicated in a group. To indicate that a set of species be considered as a group, enter the appropriate letters in parentheses (before the dash, if any). Multiple groups are allowed. As an example, entering "(amtk)gpolinx-y" will select those COGs that contain at least one archaeal member and all the small parasites, but not yeast. These include: amt--qvc-br-ujgpolinx amtk-qvcebrhujgpolinx --t--qvcebrhujgpolinx -mtk-q-c-br-ujgpolinx amt--qvcebrhujgpolinx -m-k-qvcebrhujgpolinxGroup species Considers only COGs that have at least one member from each of the group(s) specified. Input consists of lowercase letters indicating the species to be grouped. Multiple groups are allowed, and are separated by non-alphabetic characters, such as spaces or plus signs (+). In viewing the output after grouping, note that only the first letter listed in each group will appear, but will nonetheless represent the entire group. For example, the output for the group "amtk" will only show the "a" and not the others. This tool is not available on all pages - it may be substituted by the "Species" tool (described above).
Many COGs are comprised of at least one protein whose function is known, and the other members are presumed to have an identical function. These are named accordingly. However, some COGs are comprised of homologous proteins that perform different functions, or perform some function that is not precisely known. In the case of the former, the COG may have a "compound name" (separated by a slash) to indicate the known functions of some of the members. In the case of the latter, there exists three subclasses of names. "Predicted" signifies that the proteins contain a recognizable primary sequence or structural motif of known function, such as an ATP-binding motif. "Putative" signifies that a protein has a function that might be inferred by other means, such as by its presence in an operon of known function. "Uncharacterized" signifies that a function for the member proteins cannot currently be inferred. Note that COG names are not necessarily unique.
What
do the various abbreviations in COG names stand for?
Highlighted entries are linked to a PubMed primary reference or review that describes the domain, family, or superfamily.
| AAA, AAA+ | ATPases associated with a variety of cellular activities | |
| ABC | ATP-binding cassette (an ATP-dependent family of transporters) | |
| ACR | Ancient conserved region (common to two or more main phylogenetic branches - archaea, bacteria, eukarya) | |
| ACT | Aspartokinase, Chorismate mutase, TyrA (predicted ligand-binding domain)) | |
| AP | Alkaline phosphatase (metalloenzyme superfamily) | |
| ArCR | Archaeal conserved region | |
| BCR | Bacterial conserved region | |
| BMP | Basic membrane protein | |
| BRCT | BRCA1 C-terminus (domain common to cell cycle control proteins) | |
| CBS | Cystathionine beta-synthase (prototype for a family of repeats) | |
| DAHP | 3-deoxy-D-arabino-heptulosonate 7-phosphate | |
| DHH | Asp-his-his (conserved motif of a phosphoesterase superfamily) | |
| DMSO | Dimethylsulfoxide | |
| EAL | Glu-ala-lys (conserved motif of unknown function) | |
| EMAP | Endothelial-monocyte-activating polypeptide II (putative tRNA binding domain) | |
| FAD | Flavin adenine dinucleotide | |
| FHA | Forkhead associated (putative nuclear signalling domain) | |
| FKBP | FK506 binding protein | |
| FMN | Flavin mononucleotide | |
| GAD | GatB-AaRS-for D (asp) (domain possibly involved in tRNA recognition) | |
| GAF | cGMP-specific and -stimulated phosphodiesterases/adenylate cyclases (Anabaena)/FhlA (E. coli) (putative signalling domain) | |
| GGDEF | Gly-gly-asp-glu-phe (conserved motif of unknown function) | |
| HAD | Haloacid dehalogenase (superfamily of hydrolases) | |
| HAMP | Histidine kinases, adenylyl cyclases, methyl-accepting proteins, phosphatases | |
| HD | His-asp (predicted catalytic residues of the HD superfamily hydrolases) | |
| HD-GYP | His-asp, gly-tyr-pro (characteristic sequence signatures of a probable signal transduction domain) | |
| HHH | Helix-Hairpin-Helix (a non-sequence-specific DNA binding structure) | |
| HIT | Histidine triad motif (H-X-H-X-H) | |
| HSP | Heat-shock protein | |
| HTH | Helix-turn-helix (a DNA binding structure) | |
| KDO | 2-Keto-3-deoxyoctulosonic acid | |
| KH | hnRNP K homology (RNA binding domain) | |
| LPS | Lipopolysaccharide | |
| MCP | Methyl-accepting chemotaxis protein | |
| MFS | Major facilitator superfamily (transporters) | |
| NaMN:DMB | Nicotinic acid mononucleotide:5,6-dimethylbenzimidazole | |
| NRAMP | Natural resistance-associated macrophage protein | |
| ORF | Open reading frame | |
| PAC | PAS C-terminal motif (found C-terminal to a PAS domain) | |
| PAS | Per-Arnt-Sim (period/aryl hydrocarbon receptor nuclear translocator/single-minded) (putative signalling domain) | |
| PDZ | PSD-95, Dlg, ZO-1 (protein-protein interaction domain) | |
| PHP | Polymerase and histidinol phosphatase superfamily (predicted role in phosphoester bond hydrolysis) | |
| PIN | PilT N-terminus (putative signalling domain) | |
| PKD | Polycystic kidney disease immunoglobulin-like repeats (putative ligand binding domain) | |
| PLP | Pyridoxal 5'-phosphate | |
| PP | Pyrophosphatase (the PP-loop is predicted to bind the phosphate moiety of ATP) | |
| PTS | Phosphotransferase system (sugar transport and phosphorylation system) | |
| PUA | Pseudouridine synthetase, archaeosine transglycosylase (RNA binding domain) | |
| RND | Resistance/nodulation/cell division | |
| RRM | RNA recognition motif | |
| RTX | Repeats-in-toxin | |
| SAM | S-adenosylmethionine | |
| SET | Su(var)3-9, Enhancer-of-zeste, Trithorax | |
| SH3 | Src homology 3 (signalling domain) | |
| SNF | Sodium:neurotransmitter symporter (superfamily) | |
| SOS | (DNA damage response system) | |
| STAS | Sulfate transporters and antisigma-factor antagonists (predicted NTP-binding domain) | |
| THUMP | Thiouridine synthases, methylases and pseudouridine synthases (predicted RNA-binding domain) | |
| TIM | (Triosephosphate isomerase; TIM-barrel is a common 3D protein fold) | |
| TMAO | Trimethylamine N-oxide | |
| Toprim | Topoisomerase-primase (a domain with a predicted compact beta/alpha fold) | |
| TPR | Tetratricopeptide repeat (mediates protein-protein interactions) | |
| URI | UvRC and intron-encoded endonucleases (predicted catalytic domain of nucleases) |
What should I know about protein names?
Proteins in the COG database are known by their gene names, not the full protein names (e.g. "recA" instead of "recombinase"). For most species, the gene name used is based on a consecutive numbering of the open reading frames encoded by the genome. Thus, it is possible that a protein of interest may actually be known by an alternative gene name (e.g. recA from H. influenzae is "HI0600").
What is the significance of an underscore and a number appended to a protein name?
These are proteins that are fragments of a whole, divided into sections that correspond to major domains, and consecutively numbered. Doing so allows the separate domains to be assigned to different COGs. A protein may also be divided if it contains a very long region of low complexity (which will cause spurious hits).
How were genes named with respect to the species of origin?
In most cases, the gene name incorporates the initials of the species, with the following exceptions:
E. coli Uses a common name of the form xxxX or "ec" followed by the gi number B. subtilis Uses "BS_" followed by a common name of the form xxxX H. pylori J99 Uses "jhp" followed by consecutive numbering (the "j" indicates that the sequenced strain is J99) M. tuberculosis Uses "Rv" followed by consecutive numbering, with a "c" appended if encoded on the complementary strand ("Rv" refers to the sequenced strain) S. cerevisiae Uses "Y" followed by a chromosome and arm indicator and consecutive numbering, with a "w" or "c" appended to indicate the coding strand
What terminology will I need to know to use these pages effectively?
Clade A distinct phylogenetic lineage. A "clade" is composed of closely-related species. Currently, the COG database consists of the following clades:
Mycoplasma genitalium and Mycoplasma pneumoniae (letter P) Pyrococcus horikoshii and Pyrococcus abyssi (letter K) Helicobacter pylori 26695 and Helicobacter pylori J99 (letter U) Chlamydia trachomatis and Chlamydia pneumoniae (letter I)
Ortholog Proteins from different species that evolved by vertical descent (speciation). These typically retain the same function as the original. Paralog Proteins encoded within a given species that arose from one or more gene duplication events. Paralogs may evolve new functions that are related to the original.
Phylogenetic Classification of Proteins Encoded in Complete Genomes NOTE: This page is in the process of being updated. Home page
Help document contents | Look HERE. COG general topics Information
Tools
Information
What does the "Code/Name/Proteins" table show?The "Code" is a single-letter shorthand used to represent the organism given in Name. This letter will be important for using and interpreting phylogenetic patterns. The "Name" is hyperlinked to the Entrez Genomes page for that organism. "Proteins" gives the number of proteins encoded by the complete genome of the organism, and the portion of those that are included in COGs.
What does the "List of COGs" show?This link will display the list of all COGs currently in the database. The list includes the number of proteins in each COG (preceded by the number of additional proteins in the COG), a phylogenetic pattern indicating the organisms contributing to each COG, a protein identifier, and the function code, unique identifying number, and name of each COG. See the List of COGs section of help for more information.
What does the "Distribution" show?These histograms show how many COGs contain proteins from a certain number of clades or species. Clicking on the frequency above each bar will show the subset of COGs that contain proteins from the specified number of clades or species.
What does "Phylogenetic patterns" show?A phylogenetic pattern is a series of lowercase letters and/or dashes that is a shorthand representation of the presence or absence of proteins from a particular organism in the COG of interest. The link sends you to a table that shows all the phylogenetic patterns currently represented in the COG database. The patterns are organized into sets based on the presence or absence of organisms belonging to Archaea, Eukarya, or Bacteria (where "AEB" indicates all branches are present, "A" is only archaea, etc.). Within each set is listed the different patterns, along with the number of COGs represented by that pattern. Clicking on a particular pattern will list those COGs. The red numbers at the top of each set indicate the number of unique patterns in the set and the total number of COGs in the set, respectively. See the phylogenetic patterns section of help for more information on patterns, and the Selecting COGs section for information on the "Group species" tool.
What does "Co-occurrences" show?This table reveals the number of COGs with respect to a particular pair of species (indicated along the top and side using the same one-letter species code as is used for patterns). Except for the cells along the diagonal, which show simply the number of COGs containing the relevant organism, each cell consists of three numbers. The central number represents the number of COGs containing both organisms. The top number represents the number of COGs that contain the organism in that row, but do not contain the organism in that column. The bottom number represents the number of COGs that do not contain the row species, but do contain the column species. Clicking on a particular number will list the relevant COGs. You can also obtain numbers that consider only COGs that fit specified criteria using the Select tool.
What are "Functional categories" of COGs?Each COG consists of proteins that likely share a common function or domain, which in turn has a role in a given cellular process (or processes). The letters (function codes) in the table below represent these major cellular processes. All COGs have been assigned to a category (note that certain letters are reserved for COGs whose cellular process is either unknown or not well understood). In addition to the information given below, the "Functional categories" page indicates the number of COGs assigned to each group, the number of proteins or domains (since some proteins were divided) assigned to each group, and the number of pathways and functional systems, if any, that are part of each group.
Information storage and processing J Translation, ribosomal structure and biogenesis K Transcription L DNA Replication, recombination, and repair Cellular processes D Cell division and chromosome partitioning M Cell envelope biogenesis, outer membrane N Cell motility and secretion O Posttranslational modification, protein turnover, chaperones P Inorganic ion transport and metabolism T Signal transduction mechanisms Metabolism C Energy production and conversion E Amino acid transport and metabolism F Nucleotide transport and metabolism G Carbohydrate transport and metabolism H Coenzyme transport and metabolism I Lipid metabolism Poorly characterized proteins R General function prediction only S Function unknown The letters in this table represent each functional category. Clicking on a particular letter will list the subset of COGs that belong to that category.
What does "Pathways and functional systems" show?This page displays metabolic pathways and major functional systems, each linked to the subset of COGs that contribute to it. The systems listed here are specific and well-characterized. Each COG that is part of a particular pathway will also be assigned to at least one of the more general Functional categories. However, not all COGs are assigned to a pathway or functional system.
Tools
What is "COGnitor" and where does this link send me?COGnitor is a program that is used to assign new proteins to COGs. COGnitor takes a protein sequence as input, and compares it to the protein database underlying the COGs to identify the COG, if any, to which a query protein belongs. Inclusion in a COG is suggested if there are best hits to proteins from at least three lineages. The link takes you to the COGnitor input page.What does the "Phylogenetic pattern search" tool do?This tool provides a means for finding COGs that contain or exclude a selected organism. To find all COGs that contain or exclude a particular organism, simply indicate the desired choice for each listed species and submit the query. To make a selection for the entire column, click the appropriate choice at the top. The choices are:
dc The COG may or may not contain this organism Yes The COG must contain this organism No The COG must not contain this organism The list that results will be the subset of COGs that fits the pattern indicated.
What
does the "Protein/Gene name" tool search for?
This tool will perform a case-insensitive search for the COG that contains the query gene, and display that COG's page of information (unless the protein has been divided into domains, in which case a list of the relevant COGs will be displayed). If a particular protein is not currently assigned to a COG, or if the query term used is an alternative to the one used in the database, then a "Not found" message is displayed. NOTE: although COG members are grouped based on protein sequence, for brevity the names used are gene names. Thus, a search for "DNA gyrase" will fail, but a search for "gyrA" will not. Also, B. subtilis is a special case - to distinguish these gene names from those of E. coli, a "BS_" has been prepended to each name. Thus, to find the B. subtilis gyrA, enter "BS_gyrA."
What
does the "Text search" tool search for?
This tool will perform a case-insensitive search for text found in the COG number, COG name, and function code fields on the "List of COGs" page.
|
|
Each line of this page shows the number of proteins in the COG, a series of lowercase letters and dashes, a protein identifier, uppercase letters, a COG number, and a COG name.
What do the first two columns represent?
The first column shows the number of additional proteins from eukaryotes assigned to the COG. Only numbers from 1 to 9 are shown. A dash indicates that there are no additional proteins, while an asterisk indicates that there are 10 or more. Clicking on the link (number or asterisk) shows the list of such proteins. The second column represents the total number of proteins in the COG, excluding those indicated by the first column. Clicking on the number will show a multiple sequence alignment of these proteins.
What
is the series of lowercase letters and dashes?
This series is the phylogenetic pattern, which reveals the organisms that have contributed proteins to the COG. Each organism is represented by a single letter that occupies a unique and fixed position in the pattern. A dash at a given position indicates the absence of that organism in the COG. Clicking the pattern will reveal all the COGs with that same pattern. See the phylogenetic patterns section of help for more information.
What is a protein identifier?
Why don't all COGs have one?
The protein identifier is a commonly used and meaningful name derived from a current or future COG member. In some cases, it helps to identify the subunit name for COGs that represent one component of a multisubunit complex. It may also help to distinguish between different COGs with identical names. A number of COGs lack such identifiers. This indicates merely that no member has been given a meaningful name.
What
do the uppercase letter(s) preceding the COG number represent?
These represent the functional category to which the COG belongs. Clicking on a letter selects the subset of COGs belonging to that category.
These numbers are unique COG identifiers (COG names are not necessarily unique). Clicking on the COG number will send you to that COG's page.
What does clicking on "Select" do?
Clicking on "Select" will send you to an area where you can narrow down the list based on specified criteria. These criteria, described in the Selecting COGs section, can be combined.
|
|
| What sequence formats are accepted? |
| What does the "BeTs to clades" button do? |
| What does "Skip low-complexity filtering" do? |
The contents of the output page will vary depending on whether a user-provided or COG database-provided protein was used as the query. These are treated separately below.
| What can I expect to see in the output? |
| What do the various numbers mean in the COGnitor output? |
| What do the various colors in the COGnitor BLAST graphic mean? |
| Where do the links at the top of the output page send me? |
| What do the red equal signs and blue underscores mean? |
You may use the FASTA format, or a flat file format (including numbers and spaces, which are ignored). Upper or lowercase is acceptable. All non-alphabetic characters are ignored, with two exceptions. A dash (-) can be used to indicate an unknown amino acid (similar to using an X). A semi-colon (;) can be used to comment out the rest of the line (this is useful if you want to do a search with only part of the sequence, or create a small gap).
This button allows you to change the stringency of the search, to insist that any COG to which the query protein is assigned must be composed of at least the indicated number of clades. The default is three, which is the number used to define the a minimal COG.
By default, regions of low compositional complexity in the query sequence (such as runs of a single amino acid) are masked (by substituting an "X" at the appropriate positions) prior to comparison to the COG database. This increases the likelihood that the results are biologically relevant. However, filtering poses a problem for some short sequences that have low complexity regions. Such sequences may show no hits to the database at all, even if the remaining (non-masked) sequence would have yielded biologically relevant data. In such cases, it may be desirable to skip filtering. Note, however, that any alignments and predictions made using this option must be examined carefully.
The output of a COGnitor search will display information about the COG to which the query protein is predicted to belong, a color-coded BLAST graphic depicting the regions of similarity between the query protein and the subject, and the corresponding sequence alignments. The COG information will include the function class letter and name of the COG, and a unique COG number hyperlinked to that COG's page. Certain components of the output page will depend on whether an anonymous (user-provided) protein was used as the query or whether the query was a protein already part of the COG database. In the latter case, links to other pages or programs are provided.
What
do the various numbers mean in the COGnitor output?
The number next to the graphical alignment is the raw alignment score. Clicking on this number sends you to the sequence alignment between the query sequence (named in the top left corner of the page) and the subject sequence. The number in parentheses after the protein name is the number of amino acids in the subject protein.
What
do the various colors in the COGnitor BLAST graphic mean?
Each color represents the organism from which the protein comes, as listed below. Note that some colors correspond to a group of organisms.
Archaeoglobus fulgidus Methanococcus jannaschii Methanobacterium thermoautotrophicum Pyrococcus horikoshii Saccharomyces cerevisiae Aquifex aeolicus Thermotoga maritima Synechocystis sp. PCC6803 Escherichia coli Haemophilis influenzae Rickettsia prowazekii Helicobacter pylori 26695, Helicobacter pylori J99 Bacillus subtilis Mycobacterium tuberculosis Mycoplasma genitalium, Mycoplasma pneumoniae Borrelia burgdorferi, Treponema pallidum Chlamydia trachomatis, Chlamydia pneumoniae
This message appears if the query protein is not predicted to belong to any of the currently-defined COGs, or if the protein is not predicted to belong to a COG composed of the minimum number of clades indicated.
What do the red equal signs and black dashes mean?
A red equal sign (=) means that the indicated protein is in the COG to which the query protein is predicted to belong. A black dash (-) means that the protein is in a COG different from the predicted COG. If neither symbol appears, then the protein is not currently assigned to a COG. Where appropriate, the COG to which each protein belongs is indicated.
What do the "greater than" signs indicate?
These symbols are used to indicate the protein from each clade to which the query protein is most related. Thus, only the most-similar protein from any given clade (that is, the one from each clade that appears highest on the list in the BLAST graphic) will be labeled in this manner.
What
does it mean when the query protein seems divided into parts according
to the BLAST graphic?
In some cases, one set of subject proteins (that may belong to a given COG, as indicated by red equal signs) may align with only part of the query protein, while a different set of proteins (indicated by blue underscores) may align with a different part. This will occur if the query protein is a multifunctional protein, or contains more than one defined domain that is represented by different COGs. An example using the well-known multi-functional enzyme DNA polymerase I is given below.
3062 =>polA_2 (640)
1871 = HIN0273_2 (648)
1330 - polA_1 (288)
1034 =>slr0707_2 (683)
891 - HIN0273_1 (282)
671 =>HP1470_2 (606)
347 - slr0707_1 (303)
278 ->MP459 (291)
Why would I get a "No hits found" message in the output?
Usually this message occurs if a short, low-complexity sequence is used as the query. Try skipping low-complexity filtering.
What do the red equal signs and blue underscores mean?
Genbank The GenPept entry for the query protein named in the upper left corner. Genome The Entrez GENOMES sequence summary page for that protein, which shows the position of the protein in the genome. This page also provides links to perform other analyses. PSI-BLAST Begins a PSI-BLAST search using the sequence of the query protein. Default parameters are used, except that the number of descriptions and alignments are each set to 100 (see PSI-BLAST web page for default settings). Note regarding divided proteins: These links will use the undivided version as the query.
A red equal sign (=) means that the indicated protein is in the COG currently active. A blue underscore (_) means that the protein is in a COG different from the active COG. To see the best hits for any protein listed in the COGnitor output - while keeping the current COG active - click on the protein name. For proteins that are not part of the active COG, you can instead click on the blue underscore to reveal its best hits. This will, at the same time, change the active COG to the COG the clicked protein is in. Visually, this will change certain blue underscores to red equal signs.
|
|
| What are phylogenetic patterns? |
| How do I read a phylogenetic pattern? |
| How do I perform a phylogenetic pattern search? |
What are phylogenetic patterns?
A phylogenetic pattern is a series of lowercase letters and/or dashes that is a shorthand representation of the presence or absence of proteins from a particular organism in the COG of interest. Each letter in a pattern represents a particular organism, given in the table below, along with the pattern position assigned to that organism.
a Archaeoglobus fulgidus
1b Bacillus subtilis
10c Synechocystis sp. PCC6803
8e Escherichia coli
9g Mycoplasma genitalium
15h Haemophilis influenzae
12i Chlamydia trachomatis
19j Helicobacter pylori J99
14k Pyrococcus horikoshii
4l Treponema pallidum
18m Methanococcus jannaschii
2n Chlamydia pneumoniae
20o Borrelia burgdorferi
17p Mycoplasma pneumoniae
16q Aquifex aeolicus
6r Mycobacterium tuberculosis
11t Methanobacterium thermoautotrophicum
3u Helicobacter pylori 26695
13v Thermotoga maritima
7x Rickettsia prowazekii
21y Saccharomyces cerevisiae
5
How
do I read a phylogenetic pattern?
Each position of the pattern is specific for a particular organism included in the COG set. For example, position 1 will always be for Archaeoglobus fulgidus and be either an "a" (meaning the organism is present) or a dash (meaning the organism is absent). The table below indicates the organism assigned to each position, along with the shorthand letter used.
1Archaeoglobus fulgidus
a 2Methanococcus jannaschii
m 3Methanobacterium thermoautotrophicum
t 4Pyrococcus horikoshii
k 5Saccharomyces cerevisiae
y 6Aquifex aeolicus
q 7Thermotoga maritima
v 8Synechocystis sp. PCC6803
c 9Escherichia coli
e 10Bacillus subtilis
b 11Mycobacterium tuberculosis
r 12Haemophilis influenzae
h 13Helicobacter pylori 26695
u 14Helicobacter pylori J99
j 15Mycoplasma genitalium
g 16Mycoplasma pneumoniae
p 17Borrelia burgdorferi
o 18Treponema pallidum
l 19Chlamydia trachomatis
i 20Chlamydia pneumoniae
n 21Rickettsia prowazekii
x
How
do I perform a phylogenetic pattern search?
There are three ways to do this. One way is to click on a "pre-made" phylogenetic pattern available from the List of COGs page to reveal other COGs with that same pattern. You could also select the desired species by hand in that page's Select feature. Alternatively, you can use the Phylogenetic pattern search tool available from the home page.
|
|
What information is given in the header table?
The information and links given at the top of the page is identical to that described for the List of COGs page, and includes the number of proteins in the COG, the function code, the phylogenetic pattern for the COG, the unique COG number (linked to an information page for that COG), and a descriptive COG name. Where appropriate, the Pathway or functional system in which members of the COG have a role is indicated.
What does clicking on the floppy disk icon do?
This will create a file on your disk that contains FASTA formatted protein sequences for all COG members.
What does "COGs containing other domains of divided proteins" indicate?
Some COGs contain proteins that have been divided into domains. In such cases, the COG(s) that contain other domains are listed.
What happens when I follow the link for a particular protein/gene name?
Each gene is linked to a display of the COGnitor output for its encoded protein, which includes a BLAST graphic and sequence alignments between query and subjects.
What does the tree graphic show?
This is a similarity tree constructed based on the multiple sequence alignment. It is not a true phylogenetic tree, though under the assumption of a constant rate of evolution, its topology may coincide with that of the genuine phylogenetic tree. Each protein is represented by a diamond whose color corresponds to the same species color used for BLAST graphics. NOTE: some COGs (those with more than 77 members) do not show this graphic, but it can be accessed by clicking the "Tree" link in the header.
What will I find on the COG information page?
NOTE: This section refers to COG Information pages that are under construction.
Systematic
classificationThis indicates the classification system used for the proteins in the COG. Enzymes are classified according to the Enzyme Commission (EC) number, and transport proteins are classified by a Transport Classification (TC) number. "None" indicates that no system is in effect for these proteins. Each number is linked to the relevant entry. Ambiguous numbers are used when the COG contains proteins from more than one class. Such numbers are not meant to indicate that all the subclasses apply. An indication of the classification system used without a numeric entry means that this system is likely to apply to the COG after some characterization of the members. Note that we have not indicated classes (such as "uncharacterized") that are likely to need reassignment in the future. Gene names For many of the species, the gene names used are derived from a systematic numbering based on position in the genome. The names listed here are the common gene names for COG members or close homologs from other species, and are linked to a PubMed Search using the gene name as a search term. Other relevant terms may be added to refine the search. Note that even after refinement, some irrelevant citations may be listed. Also, the result of the PubMed search should not be considered exhaustive. Basis for
COG nameThese fall into four classes: Experimental indicates that one or more COG member has been experimentally characterized to have the activity indicated by the COG name. A representative sample of such COG members are linked to the relevant references.Similarity indicates that some or all of the COG members are similar (based on PSI-BLAST) to characterized proteins from other COGs or other species not yet in COGs. These are linked to the relevant references.Motif indicates that the COG members contain a sequence or structural motif with the activity or characteristics indicated by the COG name.Operon structure indicates that the COG members are part of an operon with known function. Often, the function of the COG members can be inferred from this information.Domains Gives the approximate location, size and name of the domains contained within the proteins of the COG. If the structure of a COG member, homolog, or distant homolog is known, a link to the molecular modelling database (MMDB) is given, along with its reference (linked to the gene name). In cases where the structure has not yet appeared in MMDB, only the reference is given. Modified or
new protein
sequencesLists those proteins in the COG that are not available in GenPept or have been modified with respect to the GenPept version. In some cases, we have determined that an open reading frame may have been missed in the original conceptual translation of a genome. Such new proteins were added to the COG database. In other cases, where it seemed likely that a protein was the result of a frameshift, the necessary modification was made. This field is only shown when applicable. COG notes Provides some information about the structure of the COG (such as the presence of "sub-COGs"), or general comments about the member proteins. Protein notes Provides comments about specific proteins in the COG, such as exceptions to the general trend of the COG members. Predictions Provides additional information about the predictions given in the COG name. This field is only shown when applicable. Background Provides some background information on the COG or the member proteins. This field is only shown when applicable. References Shows relevant references for the information page, if not linked elsewhere. This field is only shown when applicable.