Sequence Alignments


Introduction
Seq-align
Score: Score Of An Alignment Or Segment
Dense-diag: Segments For "diags" Seq-align
Dense-seg: Segments for "global" or "partial" Seq-align
Std-seg: Aligning Any Bioseq Type With Any Other
ASN.1 Specification: seqalign.asn
C Structures and Functions: objalign.h


 Introduction

A sequence alignment is a mapping of the coordinates of one Bioseq onto the coordinates of one or more other Bioseqs. Such a mapping may be associated with a score and/or a method for doing the alignment. An alignment can be generated algorithmically by software or manually by a scientist. The Seq-align object is designed to capture the final result of the process, not the process itself.

A Seq-align is one of the forms of Seq-annot and is as acceptable a sequence annotation as a feature table. Seq-aligns would normally be "packaged" in a Seq-annot for exchange with other tools or databases so the alignments can be identified and given a title.

The most common sequence alignment is from one sequence to another with a one to one relationship between the aligned residues of one sequence with the residues of the other (with allowance for gaps). Two types of Seq-align types, Dense-seg and Dense-diag are specifically for this type of alignment. The Std-seg, on the other hand, is very generic and does not assume that the length of one aligned region is necessarily the same as the other. This permits expansion and contraction of one Bioseq relative to another, which is necessary in the case of a physical map Bioseq aligned to a genetic map Bioseq, or a sequence Bioseq aligned with any map Bioseq.

All the forms of Seq-align are composed of segments. Each segment is an aligned region which contains only sequence or only a gap for any sequence in the alignment. Below is a three dimensional alignment with six segments:

 

   Seq-ids

   id=100         AAGGCCTTTTAGAGATGATGATGATGATGA

   id=200         AAGGCCTaTTAG.......GATGATGATGA

   id=300         ....CCTTTTAGAGATGATGAT....ATGA

                  | 1 |   2  |   3   |4| 5  | 6|  Segments

 

Taking only two of the sequences in a two way alignment, only three segments are needed to define the alignment:

 

   Seq-ids

   id=100         AAGGCCTTTTAGAGATGATGATGATGATGA

   id=200         AAGGCCTaTTAG.......GATGATGATGA

                  |     1    |   2   |     3   |  Segments

 

Seq-align

A Seq-align is a collection of segments representing one complete alignment. The whole Seq‑align may have a Score representing some measure of quality or attributing the method used to build the Seq-align. In addition, each segment may have a score for that segment alone.

type: global

A global alignment is the alignment of Bioseqs over their complete length. It expresses the relationship between the intact Bioseqs. As such it is typically used in studies of homology between closely related proteins or genomes where there is reason to believe they share a common origin over their complete lengths.

The segments making up a global alignment are assumed to be connected in order from first to last to make up the alignment, and that the full lengths of all sequences will be accounted for in the alignment.

type: partial

A partial alignment only defines a relationship between sequences for the lengths actually included in the alignment. No claim is made that the relationship pertains to the full lengths of any of the sequences.

Like a global alignment, the segments making up a partial alignment are assumed to be connected in order from first to last to make up the alignment. Unlike a global alignment, it is not assumed the alignment will necessarily account for the full lengths of any or all sequences.

A partial or global alignment may use either the "denseg" choice of segment (for aligned Bioseqs with one to one residue mappings, such as protein or nucleic acid sequences) or the "std" choice for any Bioseqs including maps. In both cases there is an ordered relationship between one segment and the next to make the complete alignment.

type: diags

A Seq-align of type "diags" means that each segment is independent of the next and no claims are made about the reasonableness of connecting one segment to another. This is the kind of relationship shown by a "dot matrix" display. A series of diagonal lines in a square matrix indicate unbroken regions of similarity between the sequences. However, diagonals may overlap multiple times, or regions of the matrix may have no diagonals at all. The "diags" type of alignment captures that kind of relationship, although it is not limited to two dimensions as a dot matrix is.

The "diags" type of Seq-align may use either the "dendiag" choice of segment (for aligned Bioseqs with one to one residue mappings, such as protein or nucleic acid sequences) or the "std" choice for any Bioseqs including maps. In both cases the SEQUENCE OF does not imply any ordered relationship between one segment and the next. Each segment is independent of any other.

dim: Dimensionality Of The Alignment

Most scientists are familiar with pairwise, or two dimensional, sequence alignments. However, it is often useful to align sequences in more dimensions. The "dim" attribute of Seq-align indicates the number of sequences which are SIMULTANEOUSLY aligned. A three dimensional alignment is a true three way alignment (ABC), not three pairwise alignments (AB, AC, BC). Three pairwise alignments are three Seq-align objects, each with dimension equal to two.

Another common situation is when many sequences are aligned to one, as is the case of a merge of a number of components into a larger sequence, or the relationship of many mutant alleles to the wild type sequence. This is also a collection of two dimensional alignments, where one of the Bioseqs is common to all alignments. If the wild type Bioseq is A, and the mutants are B, C, D, then the Seq-annot would contain three two dimensional alignments, AB, AC, AD.

The "dim" attribute at the level of the Seq-align is OPTIONAL, while the "dim" attribute is required on EACH segment. This is because it is convenient for a global or partial alignment to know the dimensionality for the whole alignment. It is also an integrity check that every segment in such a Seq-align has the same dimension. For "diags" however, the segments are independent of each other, and may even have different dimensions. This would be true for algorithms that locate the best n-way diagonals, where n can be 2 to the number of sequences. For a simple dot-matrix, all segments would be dimension two.

Score: Score Of An Alignment Or Segment

A Score contains an id (of type Object-id) which is meant to identify the method used to generate the score. It could be a string (e.g. "BLAST raw score", "BLAST p value") or an integer for use by a software system planning to process an number of defined values. The value of the Score is either an integer or real number. Both Seq-align and segment types allow more than one Score so that a variety of measures for the same alignment can be accommodated.

Dense-diag: Segments For "diags" Seq-align

A Seq-align of type "diags" represents a series of unconnected diagonals as a SEQUENCE OF Dense-diag. Since each Dense-diag is unrelated to the next the SEQUENCE OF just suggests a presentation order. It does not imply anything about the reasonableness of joining one Dense-diag to the next. In fact, for a multi-sequence comparison, each Dense-diag may have a different dimension and/or include Bioseqs not included by another Dense-diag.

A single Dense-diag defines its dimension with "dim". There should be "dim" number of Seq-id in "ids", indicating the Bioseqs involved in the segment, in order. There should be "dim" number of integers in "starts" (offsets into the Bioseqs, starting with 0, as in any Seq-loc) indicating the first (lowest numbered) residue of each Bioseq involved in the segment is, in the same order as "ids". The "len" indicates the length of all Bioseqs in the segment. Thus the last residue involved in the segment for every Bioseq will be its "start" plus ("len " - 1).

In the case of nucleic acids, if any or all of the segments are on the complement strand of the original bioseq, then there should be "dim" number of Na-strand in "strands" in the same order as "ids", indicating which segments are on the plus or minus strands. The fact that a segment is on the minus strand or not does NOT affect the values chosen for "starts". It is still the lowest numbered offset of a residue involved in the segment.

Clearly all Bioseq regions involved in a Dense-diag must have the same length, so this form does not allow stretching of one Bioseq compared to another, as may occur when comparing a genetic map Bioseq to a physical map or sequence Bioseq. In this case one would use Std-seg.

Dense-seg: Segments for "global" or "partial" Seq-align

A Dense-seg is a single entity which describes a complete global or partial alignment containing many segments. Like Dense-diag above, it is only appropriate when there is no stretching of the Bioseq coordinates relative to each other (as may happen when aligning a physical to a genetic map Bioseq). In that case, one would use a SEQUENCE OF Std-seg, described below.

A Dense-seg must give the dimension of the alignment in "dim" and the number of segments in the alignment in "numseg". The "ids" slot must contain "dim" number of "Seq-ids" for the Bioseqs used in the alignment.

The "starts" slot contains the lowest numbered residue contained in each segment, in "ids" order. The "starts" slot should have "numseg" times "dim" integers, or the start of each Bioseq in the first segment in "ids" order, followed by the start of each Bioseq in the second segment in "ids" order and so on. A "start" of minus one indicates that the Bioseq is not present in the segment (i.e. a gap in a Bioseq).

The "lens" slot contains the length of each segment in segment order, so "lens" will contain "numseg" integers.

If any or all of the sequences are on the minus strand of the original Bioseq, then there should be "numseg" times "dim" Na-strand values in "strands" in the same order as "starts". Whether a sequence segment is on the plus or minus strand has NO effect on the value selected for "starts". It is ALWAYS the lowest numbered residue included in the segment.

The "scores" is a SEQUENCE OF Score, one for each segment. So there should be "numseg" Scores, if "scores" is filled. A single Score for the whole alignment would appear in the "score" slot of the Seq-align.

The three dimensional alignment show above is repeated below, followed by its ASN.1 encoding into a Seq-align using Dense-seg. The Seq-ids are given in the ASN.1 as type "local".

   Seq-ids

   id=100         AAGGCCTTTTAGAGATGATGATGATGATGA

   id=200         AAGGCCTaTTAG.......GATGATGATGA

   id=300         ....CCTTTTAGAGATGATGAT....ATGA

                  | 1 |   2  |   3   |4| 5  | 6|  Segments

 

Seq-align ::= {

   type global ,

   dim 3 ,

   segs denseg {

      dim 3 ,

      numseg 6 ,

      ids {

         local id 100 ,

         local id 200 ,

         local id 300 } ,

      starts { 0,0,-1, 4,4,0, 12,-1,8, 19,12,15, 22,15,-1, 26,19,18 } ,

       lens { 4, 8, 7, 3, 4, 4 } } }

Std-seg: Aligning Any Bioseq Type With Any Other

A SEQUENCE OF Std-seg can be used to describe any Seq-align type on any types of Bioseqs. A Std-seg is very purely a collection of correlated Seq-locs. There is no requirement that the length of each Bioseq in a segment be the same as the other members of the segment or that the same Seq-loc type be used for each member of the segment. This allows stretching of one Bioseq relative to the other(s) and potentially very complex descriptions of relationships between sequences.

Each Std-seg must give its dimension, so it can be used for "diags". Optionally it can give the Seq-ids for the Bioseqs used in the segment (again a convenience for Seq-align of type "diags"). The "loc" slot gives the locations on the Bioseqs used in this segment. As usual, there is also a place for various Score(s) associated with the segment. The example given above is presented again, this time as a Seq-align using Std-segs. Note the use of Seq-loc type "empty" to indicate a gap. Alternatively one could simply change the "dim" for each segment to exclude the Bioseqs not present in the segment, although this would require more interpretation by software.

   Seq-ids

   id=100         AAGGCCTTTTAGAGATGATGATGATGATGA

   id=200         AAGGCCTaTTAG.......GATGATGATGA

   id=300         ....CCTTTTAGAGATGATGAT....ATGA

                  | 1 |   2  |   3   |4| 5  | 6|  Segments

 

Seq-align ::= {

   type global ,

   dim 3 ,

   segs std {

      {

         dim 3 ,

         loc {

            int {

                  id local id 100 ,

                  from 0 ,

                  to 3 } ,

            int {

                  id local id 200 ,

                  from 0 ,

                  to 3 } ,

            empty local id 300 } ,

      {

         dim 3 ,

         loc {

            int {

                  id local id 100 ,

                  from 4 ,

                  to 11 } ,

            int {

                  id local id 200 ,

                  from 4 ,

                  to 11 } ,

            int {

                  id local id 300 ,

                  from 0 ,

                  to 7 } } ,

      {

         dim 3 ,

         loc {

            int {

                  id local id 100 ,

                  from 12 ,

                  to 18 } ,

            empty local id 200 ,

            int {

                  id local id 300 ,

                  from 8 ,

                  to 14 } } ,

      {

         dim 3 ,

         loc {

            int {

                  id local id 100 ,

                  from 19 ,

                  to 21 } ,

            int {

                  id local id 200 ,

                  from 12 ,

                  to 14 } ,

            int {

                  id local id 300 ,

                  from 15 ,

                  to 17 } } ,

      {

         dim 3 ,

         loc {

            int {

                  id local id 100 ,

                  from 22 ,

                  to 25 } ,

            int {

                  id local id 200 ,

                  from 15 ,

                  to 18 } ,

            empty local id 300 } ,

      {

         dim 3 ,

         loc {

            int {

                  id local id 100 ,

                  from 26 ,

                  to 29 } ,

            int {

                  id local id 200 ,

                  from 19 ,

                  to 22 } ,

            int {

                  id local id 300 ,

                  from 18 ,

                  to 21 } } } }

Clearly the Std-seg method should only be used when its flexibility is required. Nonetheless, there is no ready substitute for Std-seg when flexibility is demanded.

ASN.1 Specification: seqalign.asn

--$Revision: 2.0 $

--**********************************************************************

--

--  NCBI Sequence Alignment elements

--  by James Ostell, 1990

--

--**********************************************************************

 

NCBI-Seqalign DEFINITIONS ::=

BEGIN

 

EXPORTS Seq-align;

 

IMPORTS Seq-id, Seq-loc , Na-strand FROM NCBI-Seqloc

        Object-id FROM NCBI-General;

 

--*** Sequence Alignment ********************************

--*

 

Seq-align ::= SEQUENCE {

    type ENUMERATED {

        not-set (0) ,

        global (1) ,

        diags (2) ,

        partial (3) ,           -- mapping pieces together

        other (255) } ,

    dim INTEGER OPTIONAL ,     -- dimensionality

    score SET OF Score OPTIONAL ,   -- for whole alignment

    segs CHOICE {                   -- alignment data

        dendiag SEQUENCE OF Dense-diag ,

        denseg Dense-seg ,

        std SEQUENCE OF Std-seg } }

       

       

Dense-diag ::= SEQUENCE {         -- for (multiway) diagonals

    dim INTEGER DEFAULT 2 ,    -- dimensionality

    ids SEQUENCE OF Seq-id ,   -- sequences in order

    starts SEQUENCE OF INTEGER ,  -- start OFFSETS in ids order

    len INTEGER ,                 -- len of aligned segments

    strands SEQUENCE OF Na-strand OPTIONAL ,

    scores SET OF Score OPTIONAL }

 

    -- Dense-seg: the densist packing for sequence alignments only.

    --            a start of -1 indicates a gap for that sequence of

    --            length lens.

    --

    -- id=100  AAGGCCTTTTAGAGATGATGATGATGATGA

    -- id=200  AAGGCCTTTTAG.......GATGATGATGA

    -- id=300  ....CCTTTTAGAGATGATGAT....ATGA

    --

    -- dim = 3, numseg = 6, ids = { 100, 200, 300 }

    -- starts = { 0,0,-1, 4,4,0, 12,-1,8, 19,12,15, 22,15,-1, 26,19,18 }

    -- lens = { 4, 8, 7, 3, 4, 4 }

    --

 

Dense-seg ::= SEQUENCE {          -- for (multiway) global or partial alignments

    dim INTEGER DEFAULT 2 ,       -- dimensionality

    numseg INTEGER ,              -- number of segments here

    ids SEQUENCE OF Seq-id ,      -- sequences in order

    starts SEQUENCE OF INTEGER ,  -- start OFFSETS in ids order within segs

    lens SEQUENCE OF INTEGER ,    -- lengths in ids order within segs

    strands SEQUENCE OF Na-strand OPTIONAL ,

    scores SEQUENCE OF Score OPTIONAL }  -- score for each seg

 

Std-seg ::= SEQUENCE {

    dim INTEGER DEFAULT 2 ,       -- dimensionality

    ids SEQUENCE OF Seq-id OPTIONAL ,

    loc SEQUENCE OF Seq-loc ,

    scores SET OF Score OPTIONAL }

 

Score ::= SEQUENCE {

    id Object-id OPTIONAL ,

    value CHOICE {

        real REAL ,

        int INTEGER  } }

 

END

C Structures and Functions: objalign.h

/*  objalign.h

* ===========================================================================

*

*                            PUBLIC DOMAIN NOTICE                         

*               National Center for Biotechnology Information

*                                                                         

*  This software/database is a "United States Government Work" under the  

*  terms of the United States Copyright Act.  It was written as part of   

*  the author's official duties as a United States Government employee and

*  thus cannot be copyrighted.  This software/database is freely available

*  to the public for use. The National Library of Medicine and the U.S.   

*  Government have not placed any restriction on its use or reproduction. 

*                                                                         

*  Although all reasonable efforts have been taken to ensure the accuracy 

*  and reliability of the software and data, the NLM and the U.S.         

*  Government do not and cannot warrant the performance or results that   

*  may be obtained by using this software or data. The NLM and the U.S.   

*  Government disclaim all warranties, express or implied, including      

*  warranties of performance, merchantability or fitness for any particular

*  purpose.                                                               

*                                                                          

*  Please cite the author in any work or product based on this material.  

*

* ===========================================================================

*

* File Name:  objalign.h

*

* Author:  James Ostell

*  

* Version Creation Date: 4/1/91

*

* $Revision: 2.0 $

*

* File Description:  Object manager interface for module NCBI-Seqalign

*

* Modifications: 

* --------------------------------------------------------------------------

* Date    Name        Description of modification

* -------  ----------  -----------------------------------------------------

*

*

* ==========================================================================

*/

 

#ifndef _NCBI_Seqalign_

#define _NCBI_Seqalign_

 

#ifndef _ASNTOOL_

#include <asn.h>

#endif

#ifndef _NCBI_General_

#include <objgen.h>

#endif

#ifndef _NCBI_Seqloc_

#include <objloc.h>

#endif

 

#ifdef __cplusplus

extern "C" {

#endif

 

/*****************************************************************************

*

*   loader

*

*****************************************************************************/

extern Boolean SeqAlignAsnLoad PROTO((void));

 

/*****************************************************************************

*

*   internal structures for NCBI-Seqalign objects

*

*****************************************************************************/

 

/*****************************************************************************

*

*   Score

*     NOTE: read, write, and free always process GROUPS of scores

*

*****************************************************************************/

typedef struct score {

    ObjectIdPtr id;

    Uint1 choice;          /* 0=not set, 1=int, 2=real */

    DataVal value;

    struct score PNTR next;    /* for sets of scores */

} Score, PNTR ScorePtr;

 

ScorePtr ScoreNew PROTO((void));

Boolean ScoreSetAsnWrite PROTO((ScorePtr sp, AsnIoPtr aip, AsnTypePtr settype, AsnTypePtr elementtype));

ScorePtr ScoreSetAsnRead PROTO((AsnIoPtr aip, AsnTypePtr settype, AsnTypePtr elementtype));

ScorePtr ScoreSetFree PROTO((ScorePtr anp));

 

/*****************************************************************************

*

*   SeqAlign

*   type =  type of alignment

        not-set (0) ,

        global (1) ,

        diags (2) ,

        partial (3) ,           -- mapping pieces together

        other (255) } ,

    segtype = type of segs structure

        not-set 0

        dendiag 1

        denseq 2

        std 3

*  

*

*****************************************************************************/

typedef struct seqalign {

    Uint1 type,

        segtype;

    Int2 dim;

    ScorePtr score;

    Pointer segs;

    struct seqalign PNTR next;

} SeqAlign, PNTR SeqAlignPtr;

 

SeqAlignPtr SeqAlignNew PROTO((void));

Boolean SeqAlignAsnWrite PROTO((SeqAlignPtr anp, AsnIoPtr aip, AsnTypePtr atp));

SeqAlignPtr SeqAlignAsnRead PROTO((AsnIoPtr aip, AsnTypePtr atp));

SeqAlignPtr SeqAlignFree PROTO((SeqAlignPtr anp));

 

/*****************************************************************************

*

*   SeqAlignSet

*

*****************************************************************************/

Boolean SeqAlignSetAsnWrite PROTO((SeqAlignPtr anp, AsnIoPtr aip, AsnTypePtr set, AsnTypePtr element));

SeqAlignPtr SeqAlignSetAsnRead PROTO((AsnIoPtr aip, AsnTypePtr set, AsnTypePtr element));

 

/*****************************************************************************

*

*   DenseDiag

*  

*

*****************************************************************************/

typedef struct dendiag {

    Int2 dim;                   /* this is a convenience, not in asn1 */

    SeqIdPtr id;

    Int4Ptr starts;

    Int4 len;

    Uint1Ptr strands;

    ScorePtr scores;

    struct dendiag PNTR next;

} DenseDiag, PNTR DenseDiagPtr;

 

DenseDiagPtr DenseDiagNew PROTO((void));

Boolean DenseDiagAsnWrite PROTO((DenseDiagPtr ddp, AsnIoPtr aip, AsnTypePtr atp));

DenseDiagPtr DenseDiagAsnRead PROTO((AsnIoPtr aip, AsnTypePtr atp));

DenseDiagPtr DenseDiagFree PROTO((DenseDiagPtr ddp));

 

/*****************************************************************************

*

*   DenseSeg

*  

*

*****************************************************************************/

typedef struct denseg {

    Int2 dim,

        numseg;

    SeqIdPtr ids;

    Int4Ptr starts;

    Int4Ptr lens;

    Uint1Ptr strands;

    ScorePtr scores;

} DenseSeg, PNTR DenseSegPtr;

 

DenseSegPtr DenseSegNew PROTO((void));

Boolean DenseSegAsnWrite PROTO((DenseSegPtr dsp, AsnIoPtr aip, AsnTypePtr atp));

DenseSegPtr DenseSegAsnRead PROTO((AsnIoPtr aip, AsnTypePtr atp));

DenseSegPtr DenseSegFree PROTO((DenseSegPtr dsp));

 

/*****************************************************************************

*

*   StdSeg

*  

*

*****************************************************************************/

typedef struct stdseg {

    Int2 dim;

    SeqIdPtr ids;    /* SeqId s */

    SeqLocPtr loc;    /* SeqLoc s */

    ScorePtr scores;

    struct stdseg PNTR next;

} StdSeg, PNTR StdSegPtr;

 

StdSegPtr StdSegNew PROTO((void));

Boolean StdSegAsnWrite PROTO((StdSegPtr ssp, AsnIoPtr aip, AsnTypePtr atp));

StdSegPtr StdSegAsnRead PROTO((AsnIoPtr aip, AsnTypePtr atp));

StdSegPtr StdSegFree PROTO((StdSegPtr ssp));

 

#ifdef __cplusplus

}

#endif

 

#endif