Flu home Database Genome Set Alignment Tree BLAST Annotation FTP Help Contact us
Influenza Virus Resource presents data obtained from the NIAID Influenza Genome Sequencing Project as well as from GenBank, combined with tools for flu sequence analysis and annotation. In addition, it provides links to other resources that contain flu sequences, publications and general information about flu viruses.

Read more about: This resource | Flu database | NIAID Influenza Sequencing Project | Influenza virus biology
toggle on/off NCBI
toggle on/offFlu resources
toggle on/offCollaborators
Help Document
The NCBI Influenza Virus Sequence Database
Genome Set
Alignment
Clustering and phylogenetic analysis
Sequence annotation
FTP

The NCBI Influenza Virus Sequence Databaseback to top
To search the database, first decide whether you would like to search for protein sequences, their coding regions, or nucleotide sequences, by checking the radio buttons in front of the sequence types. Then, select one name each from the lists provided, and/or fill in the boxes. The fields are Virus species (e.g. Influenzavirus A or Influenzavirus B), Host (e.g. Human or Avian), Subtypes (e.g. H3N2 or H5), Segment (1 through 8) or protein name (e.g. PB1-F2 or M1), Country/Region (e.g. Australia or Asia) or Year (or a range of year) viruses were isolated, and a range of the lengths of the sequences.

In the advanced database search tool, multiple names can be selected simultaneously for Species, Host, Country/Region, Segment and Subtype. A list of subtypes separated by commas (e.g. H5N1,H3,N2) can be entered in the boxes after "Only these Subtypes" and/or "All Subtypes except". The number of sequences found by a query will be displayed after the "Update count" button is clicked.

A string of word or nucleotide/protein sequence (e.g. New York, AGCGAAAGCAGGGGT or RSKV) can be added to the "Search by a string" box to be included in the search.

You can limit your search results to "Full-length sequences only" by checking the appropriated boxes.

By checking the box in front of "Remove identical sequences", all groups of identical sequences in a dataset will be represented by the oldest sequence in the group. This will reduce the number of sequences in some cases by keeping only unique sequences in a dataset.

By checking the box in front of "Sequences from the FLU project only", you can limit your search results to sequences from large scale flu genome sequencing projects only, which usually contain complete genomes, detailed source information and high quality of annotations. Currently, this includes sequences from the NIAID Influenza Genome Sequencing Project, the St. Jude Influenza Genome Project, the Centers for Disease Control and Prevention, Air Force Institute for Operational Health, and the University of Hong Kong.

Sequences of recombinant or lab strains (those flagged as "LAB" in the country field) are not included in the search by default, and the box next to "Include Lab strains" should be checked if desired.

After clicking the "Add to Query Builder" button, the query you selected and the number of resulting sequences will be shown in "Query Builder". If "any" is selected in "Virus Species" and/or "Segment", a warning message in red will be shown and the "Multiple sequence alignment" and "Tree building" functionalities will not be allowed in the subsequent steps. A sample query page can be found here. Nucleotide or protein sequences can also be searched by adding the accession number in the box to the left of the "Find a sequence by Accession" button.

Multiple queries can be built by repeating the above steps. When a different "Virus Species" and/or "Segment" is selected in the new query, the same warning message in red described above will be shown and the "Multiple sequence alignment" and "Tree building" functionalities will not be allowed in the subsequent steps. When a different sequence type (i.e. Protein sequence, Coding region or Nucleotide sequence) is selected for the new query, a pop-up window will ask whether you indeed would like to start a new query with a new sequence type (which will clear the current "Query Builder"), or you want to continue with the current sequence type by going back to the current query builder. This is to prevent mixing different sequence types in the same "Query Builder" (e.g. protein sequences with nucleotide sequences). Queries in any combination from the "Query Builder" can be selected to get sequences from the database.

Sequences found by the selected queries will be shown in a separate window once you click the "Get sequences" button. Sequence display can be reordered by up-to three fields sequentially, by selecting one field each from the "Ordered by the following fields" boxes. A sample resulting page can be found here.

Sequences of interest can be selected by checking the boxes to the left of accession numbers. The corresponding protein, coding region or nucleotide sequences of the selected sequences can be downloaded by selecting the appropriate name in the "Select FASTA sequences to download" drop-down menu. To help users identify the downloaded sequences, the following string is inserted between the GenBank sequence identifier and the sequence title in the FASTA definition lines: /host/segment number(name)/subtype/country/year/month/date/. A list of GenBank accession numbers for selected protein or nucleotide sequences can also be downloaded from the "Select accession list to download" menu.

Further sequences analysis of the selected sequences can be performed by clicking the "Do multiple alignment" or "Build a tree" button, if they are allowed (i.e. no mixing species and/or segments were selected in the queries). User's own sequences (of the same sequence type in FASTA format) can be added to the selected sequences for analysis, by clicking the "Add your own sequences" button. The number of sequences added cannot be more than 128 KB in file size.

Genome Setback to top
The Influenza Virus Genome Set Tool displays nucleotide sequences obtained from the NCBI Influenza Virus Sequence Database ordered by genome segments for each virus. All segments of the same virus are grouped together in the same background color, alternating in light blue and white. Genomes of the same virus isolate but sequenced in different labs are identified in the database, and are grouped separately based on the sequence submitters. This tool is a convenient way to check the completeness of genome segments for viruses of interest.

Database searches can be performed similarly as described above, and nucleotide sequences can also be searched by adding a complete or partial virus name (e.g. Influenza A virus (A/New York/19/2003(H3N2)) or New York) in the box to the left of "Search by a string". By default, this tool only gets viruses with a complete set of segments in full-length (or nearly full-length). To get all viruses with any number of sequences, check the box after "Show all sequences". The results are shown in the descending order by the number of segments the viruses have.

Alignmentback to top
Multiple alignments of nucleotide or protein sequences from the NCBI Influenza Virus Sequence Database and/or user's input file can be obtained, using the MUSCLE program. Start the alignment by selecting the Alignment button in the top horizontal bar. This will open a database query interface similar to the one described above. Please follow the instruction for database query and be sure to select sequences from the same segment of the genome, and preferably of similar sizes.

A maximum number of 1,000 is set for sequences allowed to be included in the alignment. For datasets larger than 1,000 sequences, it is recommended to download the sequences using the download tool of the database, and run the multiple sequence alignment using a program (e.g. MUSCLE) installed locally.

After sequences of interest are selected from the database and/or added from an input file, click the "Do multiple alignment" button to get the alignment. The consensus sequence is displayed at the top of the alignment, and identical sequences to the consensus are shown in dots and gaps are shown in dashes. In the coding region alignment, non-synonymous changes (in triplets) are highlighted in a different background color. The alignment can be navigated horizontally either by typing in the position you would like the sequences to start from in the text box after "Go to position" and clicking "Go", or by moving the bottom scroll bar that wraps the alignment. When a sequence in the alignment is clicked, a small window will be popped up. The GenBank record for the sequence can be opened by clicking the accession number in the pop-up window. The sequence can also be selected to perform BLAST 2 Sequences (Click the "BLAST 2 seq." button after two different sequences are selected from the alignment). By clicking the "Select for anchor" option from the pop-up window, the consensus sequence will be replaced by the selected sequence. When the anchor sequence is clicked, a small window with options will be popped up. The anchor sequence can be reset to the consensus sequence, and the anchor/consensus sequence can be displayed for copying. The multiple alignment file in FASTA format can be downloaded by selecting "Download alignment". A printer-friendly version of the alignment can be obtained by clicking the "Print-friendly version" button. If desired, click the "Build a tree" button to build a tree from the aligned sequences.

Clustering and phylogenetic analysisback to top
Scope Interactive tool DatasetExplorer is a part of the NCBI Influenza Virus Resource that provides an easy way to perform preliminary analysis on nucleotide and protein sequences from the NCBI Influenza Virus Sequence Database and/or user's input file. Datasets are visually represented using phylogenetic/clustering trees. Users can select an algorithm to be used for building a tree as well as similarity criterion.

Overview of the Methodology First of all, start the tool by clicking the "Tree" button in the top horizontal bar. Sequences are acquired from the NCBI Influenza Virus Sequence Database or uploaded by a user as described above. After a dataset has been selected, the sequences are aligned using a multiple alignment algorithm, in order to identify common regions in the sequences and establish correspondence between sequence columns (we perform multiple protein alignment, while alignment of the nucleotide sequences for the coding regions is induced by the protein alignment). Distances between sequences are calculated based on their dissimilarity in a selected region on the alignment, and analysis is performed. We offer visualization based on phylogenetic and clustering tree methods: the classical neighbor-joining method and agglomerative hierarchical clustering methods.

Alignment of protein sequences is performed using the protein multiple alignment tool MUSCLE. We offer different distance measures for calculating pairwise distances between sequences. Particularly, we use some distances implemented in PHYLIP package, as well mPAM weight matrix.

Sequence Alignment The tool performs multiple protein alignments using the MUSCLE program and creates nucleotide alignment of the corresponding coding regions from protein alignment by using codon-amino acid correspondence.

After sequences are obtained from the NCBI Influenza Virus Sequence Database and/or users' input file, click the "Build a tree" button in the database query results page to start the process. This will bring a window with graphic view of the multiple sequence alignments.

Sequence Region Selection The graphic view of the multiple alignments of sequences selected from the previous step is displayed. The black and red colors in the graphics represent the presence and absence of amino acid residues at the corresponding positions. The positions in the longest sequence of the selected set for the first and last amino acid of each sequence are shown. A histogram showing the total number of amino acid residues at each position is displayed at the top of the page. The program automatically selects the sequence region to be analyzed so that the majority of the sequences in the set will be included. The sequence region can also be defined by users by first selecting all sequences in the set, and then entering the start and end positions in the boxes provided. When clicking the "Select sequences" button, the region from sequences that have complete coverage between the two positions will be selected, and sequences excluded from the selection will be highlighted with a background color in the graphic view.

Phylogenetic/Clustering Tree A clustering or phylogenetic tree can be built by selecting one of the clustering algorithms and a distance calculating method from the list, and clicking the "Next step" button.

Sequences of interest can be highlighted in the tree, and they can be selected or deselected using the check boxes to the right of each sequences.

Distance methods approximating minimum evolution
Method Description
Neighbor-Joining At each step, a pair with a smallest value of Dij - bi - bj is chosen, where Dij is the distance between nodes i and j, and bi = ∑kn Dij /(n-2). The distance between the new node u and each of remaining nodes is defined as Duk = (Dik + Djk - Dij ) /2. Branch lengths are defined as vui = (Dij + bi - bj ) /2 and vuj = (Dij + bj - bi ) /2 (negative lengths are truncated to zero).

Agglomerative hierarchical clustering methods
Method Alternative name Distance between clusters defined as:
Average Linkage UPGMA Average distance between pair of objects, one in one cluster, one in another
Complete Linkage Further Neighbor Maximum distance between pair of objects, one in one cluster, one in another
Single Linkage Nearest Neighbor Minimum distance between pair of objects, one in one cluster, one in another

Protein and Nucleotide Distances We offer different distance measures for calculating nucleotide and protein pairwise sequence distances, such as those based on Felsenstein F84 distance and Hammering distance for nucleotide sequences; the Dayhoff PAM matrix, the JTT matrix model, the PBM model, and Kimura's approximation for protein sequences implemented in the PHYLIP package, as well as the mPAM weight matrix for protein sequences.

Tree Modification An adaptive approach is used to visualize the tree in an aggregated form adapted to the user's screen, allowing users to interactively refine or aggregate visualization of different parts of the tree (see a paper for details). A branch on the tree can be selected by clicking the root node, and the resolution of the selected branch can be changed by moving along the scale bar. Sequences on the tree can be searched by the fields in the database, and the resulting sequences or groups will be highlighted in green color.

Tree Export The complete tree can be exported in the Newick format by clicking the "Download full tree" button. The downloaded tree can be displayed by many tree-viewing programs.

Sequence annotationback to top
The Influenza Virus Sequence Annotation Tool is a web application for user-provided Influenza A virus and Influenza B virus sequences. It can predict protein sequences encoded by a flu sequence and produce a feature table that can be used for sequence submission to GenBank, as well as a GenBank flat file.

The type/segment/subtype of an input influenza sequence is first determined by BLAST, and then aligned against a corresponding sample protein set with a "Protein to nucleotide alignment tool" (ProSplign). The translated product from the best alignment to the sample protein sequence is used as the predicted protein encoded by the input sequence.

Type/segment/subtype identification
An input sequence is searched by BLAST against a specialized influenza sequences database to determine the virus type (A or B), segment (1 through 8) and subtype for the hemagglutinin and neuraminidase segments of Influenza A virus. The database contains one reference sequence for each virus segment and each subtype of the hemagglutinin and neuraminidase (available here). The top hit in the BLAST result is used to determine the virus type/segment/subtype of the input sequence.

Sample protein sequences
Representatives of published protein and mature peptide sequences for each virus segment and different subtypes for the hemagglutinin and neuraminidase segments of Influenza A virus are maintained on the server side (available in the PROTEIN-A and PROTEIN-B directories located here). For the segments that encode proteins with large variations in amino acid sequences and mature peptide cleavage sites, more than one protein could be chosen to be included. For example, this collection currently has 16 different protein samples for hemagglutinin of Influenza A virus. Based on the segment and subtype determined by the BLAST result, a subset of sample protein sequences is selected and aligned against the input sequence.

Protein to nucleotide alignment
A special global protein-to-nucleotide alignment tool, ProSplign, was designed to accurately annotate spliced genes and mature peptides of influenza viruses. ProSplign also handles input sequences with insertions and/or deletions which may cause a frame shift in the coding region.

Interpreting alignment result and creating outputs
A successful protein-to-nucleotide alignment should pass the following criteria:
1) The input sequence should start with a correct start codon (or span the beginning of input sequence in case of partial 5' end)
2) The input sequence should end with one of the stop codons (or span the end of input sequence in case of partial 3' end)
3) The input sequence should have no frameshifts or internal stop codons
4) The number of exon(s) must be correct (2 for the second protein of segments 7 and 8 of Influenza A virus and segment 8 of Influenza B virus, 1 exon for all other segments/proteins)

If an alignment passes all four criteria above, the tool adopts the translated protein from the alignment as the protein prediction. Positions of the start, stop, splice sites (if present) and mature peptide are taken from the alignment. If an alignment doesn't pass any of the criteria, the tool iterates further by aligning next sample protein from the reference subset. If none of the sample proteins can be used to produce a decent alignment, the best aligned sample protein (with the highest alignment score) will be used to generate an error report.

The first output of a successful annotation is a feature table, which is a five-column, tab-delimited table of feature locations and qualifiers. The tool also creates the ASN.1, XML and GenBank formatted views of the same annotation, using the following NCBI developed utilities: tbl2asn and asn2xml.

Drug resistance prediction
The most common signature mutations that might confer drug resistance by the virus can also be detected and reported by this tool. Such mutations include L26F (e.g. CY009837), V27A (e.g. DQ186974), A30T (e.g. EU263348), S31N (e.g. DQ107508) and G34E (e.g. L25818) in the M2 protein, H274Y (e.g. DQ250165) and N294S (e.g. EF222322) in the N1 subtype of neuraminidase, and R292K (e.g. AY643089) and E119G/D/A/V (e.g. EU429720) in the N2 subtype of neuraminidase.

Instructions
To use the tool, simply add one or multiple nucleotide sequences in FASTA format into the sequence box. Sequences can also be imported from a file by clicking the "Browse" button. After the "Annotate FASTA" button is clicked, feature tables separated by a line of equal signs for each input sequence are shown in a separate window. A message showing the predicted segment, and subtype for the hemagglutinin and neuraminidase segments will also be displayed. Warning messages will be shown along with the feature table, if the input sequence does not have a start/stop codon or contains ambiguity sequences. In case frameshifts are found in the coding regions, or a stop codon is introduced within the coding region because of a mutation, no feature table will be produced and an error message will be shown instead, indicating the nature (insertion, deletion or mutation), the length and the location of the error. Other output format (GenBank flat file, ASN.1, XML, protein FASTA and alignment) can be selected and be shown on the browser or saved to files.

This annotation tool uses published influenza protein sequences as training sets. There are chances that it will not work as expected for some new sequence variations. Please report such cases to us so we can improve this tool.

How to cite the annotation tool
Bao Y, Bolotov P, Dernovoy D, Kiryutin B, Tatusova T. FLAN: a web server for influenza virus genome annotation. Nucleic Acids Research. 2007 Jul 1;35(Web Server issue):W280-4.

FTPback to top
Data in the NCBI Influenza Virus Sequence Database are available through ftp. The ftp directory contains the following files and the corresponding compressed versions that are updated everyday:

genomeset.dat - Table with supplementary genomeset data
influenza_na.dat - Table with supplementary nucleotide data
influenza_aa.dat - Table with supplementary protein data
influenza.dat - Table with nucleotide, protein and coding regions IDs
influenza.fna - FASTA nucleotide
influenza.cds - FASTA coding regions
influenza.faa - FASTA protein

The genomeset.dat contains information for sequences of viruses with a complete set of segments in full-length (or nearly full-length). Those of the same virus are grouped together and separated by an empty line from those of other viruses.

The genomeset.dat, influenza_na.dat and influenza_aa.dat files are tab-delimitated tables which have the following fields:
GenBank accession number, Host, Genome segment number, Subtype, Country, Year, Sequence length, Virus name, Age, Gender. The influenza_na.dat and influenza_aa.dat files have an additional field in the last column to indicate if a sequence is full-length.

The influenza.dat file is a tab-delimitated table which has the following fields:
GenBank accession number for nucleotide GenBank accession number for protein Identifier for protein coding region

A directory named "updates" contains daily updates for all of the above listed files in subdirectories for each date.

A directory named "ANNOTATION" contains reference sequences used in the Influenza Virus Sequence Annotation Tool. The file blastDB.fasta has one representative sequence for each type/segment/subtype of influenza viruses A and B, and it is used to build a specialized BLAST database for the determination of type/segment/subtype of input influenza virus sequences. The PROTEIN-A and PROTEIN-B subdirectories each contains sample protein and mature peptide sequences used to annotate user-provided sequences.

|Disclaimer |Privacy statement | Accessibility |
NCBI Home NCBI Search NCBI SiteMap