How to use

The NCBI Influenza Virus Sequence Database

The NCBI Influenza Virus Sequence Database contains nucleotide sequences of all influenza viruses in GenBank, as well as protein sequences and their encoding regions derived from the nucleotide sequences. The database is updated usually within a day or two after new sequences become available in GenBank. Information for database fields (subtype, segment, host, country, year etc.) is extracted automatically from GenBank records, and examined by NCBI staff. Information not available in GenBank records is obtained from the literature or through direct contact with sequence submitters, whenever possible.

Get sequences by accession

Nucleotide or protein sequences can be searched by adding a comma or space separated list of GenBank accession numbers or uploading a text file containing such a list under the "Get sequences by accession" section. The sequences can be added to the "Query builder" or shown directly by clicking the "Add query" or "Show results" buttons.

To search the database using other terms, first decide whether you would like to search for protein sequences, their coding regions, or nucleotide sequences, by checking the radio buttons to the left of the sequence type names.

Search for keyword

The "Search for keyword" section allows users to search for sequences by 1). a string of word in virus strain names (e.g. New York); 2). a pattern in nucleotide or protein sequences (e.g. AGCGAAAGCAGGGGT or RSKV); and 3). drug-resistance mutations in protein sequences (e.g. S31N or H274Y). A list of mutations annotated in the database can be found here.

Define search set

In the "Define search set" section, select one or multiple names (by holding the Ctrl or Shift key) each from the lists provided, and/or fill in the boxes. The fields are virus type (e.g. Influenza virus A, B or C), Host (e.g. Human or Avian), Country/Region (e.g. Australia or Asia) or Year (or a range of year) viruses were isolated, Segment (1 through 8) or protein name (e.g. PB1-F2 or M1), Subtypes (e.g. H3N2 or H5), and a range of the lengths of the sequences.

Restrict your search only for full-length sequences

You can limit your search results to full-length sequences by checking the appropriated boxes. "Full-length only" applies to sequences that have complete coding regions including start and stop codons, and they are labelled as "c" (for complete) in the database query result. "Full-length plus" applies to all "Full-length only" sequences, plus those only missing start and/or stop codons, which are labelled as "nc" (for nearly complete) in the database query result. Partial sequences are labelled as "p" in the database query result.

Choose collection and release dates

Month and day can be added in addition to year. Please note that not all sequences have month and day available. Therefore sequences with only year as collection date will not be included in a search if a month of the corresponding year is entered in the query. For example, a search for sequences from 2006/05 to 2008/11 will retrieve those with month in collection date for 2006 and 2008, but not those with only 2006 or 2008 as collection date (because they could be from 2006/04 or 2008/12). However, all sequences from 2007, with or without month in the collection date, will be included in such a query. Check the boxes next to "Month" or "Day" under "Collection date must contain" if one wants to retrieve only sequences with month or day in the collection date.

Released date is the date when a sequence first appeared in GenBank.

Required segments

Check boxes next to the segment/protein names under "Required segments" to retrieve sequences defined in the "Segment/Protein" field when all of the selected segments of the same virus isolate exist in the database. Check the "Full-length only" box in this section if the required segments must be full-length.

Full Genome Sets

An interface has been developed with pre-set values for downloading full genome sets: Full Genome Sets Interface. This interface can be used to obtain flu sequences ordered by genome segments, and all of the format options for “Download results” will be grouped. Downloaded nucleotide sets will be in the order Genome #1 segment 1 through 8, Genome #2 segment 1 through 8, etc. The order for downloaded protein sets is PB2, PB1, PB1-F2, PA-X, PA, HA, NP, NA, M1, M2, NS1, NS2.

Include/exclude pandemic (H1N1) viruses

From a drop-down menu next to "Pandemic (H1N1) viruses" (also known as the swine flu outbreak), you can include, exclude or retrieve only these sequences in your search results. Newly released sequences can be retrieved from the database by defining the GenBank release date. For example, A(H1N1)pdm09 virus sequences released in GenBank between June 30 and July 6, 2010 can be retrieved using this database query.

Include/exclude sequences from the FLU project

From a drop-down menu next to "The FLU project", you can include, exclude or retrieve only these sequences in your search results. Sequences from the FLU project are those submitted to GenBank through a streamlined GenBank submission pipeline. These are mostly from large scale flu genome sequencing projects, which usually contain complete genomes, detailed source information and high quality of annotations. Currently, the major contributors are the NIAID Influenza Genome Sequencing Project, the St. Jude Influenza Genome Project, the Centers for Disease Control and Prevention, Centers of Excellence for Influenza Research and Surveillance (CEIRS), and the University of Hong Kong.

Include/exclude lab strains

Sequences of reassortments or lab strains (those flagged as "LAB" in the country field) are excluded in the search by default, and the drop-down menu next to "Lab strains" can be used should you want to include or retrieve only those sequences.

Include/exclude vaccine strains

From a drop-down menu next to "Vaccine strains", you can include, exclude or retrieve only sequences of WHO recommended vaccine strains in your search results.

Include/exclude viruses of well-defined lineages

From a drop-down menu next to "Lineage defining strains", you can include, exclude or retrieve only sequences of prototype viruses of well-defined lineages/clades. Currently, this includes those for Influenza B viruses (Victoria and Yamagata), and the H5N1 and H9N2 subtypes of Influenza A viruses.

Collapse identical sequences

By checking the box next to "Collapse identical sequences", all groups of identical sequences in a dataset will be represented by the oldest sequence in the group. This will reduce the number of sequences in some cases by keeping only unique sequences in a dataset.

Build query/multiply queries

After clicking the "Add Query" button, the query you selected and the number of resulting sequences will be shown in "Query Builder". If "any" is selected in "Virus Species" and/or "Segment", a warning message will be shown and the "Multiple sequence alignment" and "Tree building" functionalities will not be allowed in the subsequent steps when the resulting dataset contains sequences from different virus species and/or different segment. A sample query page can be found here.

Multiple queries can be built by repeating the above steps. When a different "Virus Species" and/or "Segment" is selected in the new query, the same warning message described above will be shown and the "Multiple sequence alignment" and "Tree building" functionalities will not be allowed in the subsequent steps, if the resulting dataset contains sequences from different virus species and/or different segment. When a different sequence type (i.e. Protein sequence, Coding region or Nucleotide sequence) is selected for the new query, a pop-up window will ask whether you indeed would like to start a new query with a new sequence type (which will clear the current "Query Builder"), or you want to continue with the current sequence type by going back to the current query builder. This is to prevent mixing different sequence types in the same "Query Builder" (e.g. protein sequences with nucleotide sequences). Queries in any combination from the "Query Builder" can be selected to get sequences from the database.

Show results

Sequences found by the selected queries will be shown in a separate window once you click the "Show results" button. By default, the sequences are ordered by the virus names. They can be reordered by up-to three fields sequentially, by holding the Ctrl or Shift key while clicking on field headers. A sample resulting page can be found here.

Customize FASTA defline

The corresponding protein, coding region or nucleotide sequences of the selected sequences can be downloaded by selecting the appropriate name in the "Download results" drop-down menu. To meet the need of different users, the definition line of the FASTA sequences in the downloaded files can be customized by clicking "Customize FASTA defline". The default defline is in the format of ">{accession} {strain} {year}/{month}/{day} {segname}" (e.g. >ADA83577 A/Argentina/HNRG13/2009 2009/06/05 PB2), but you are able to add any fields by clicking the ones listed, or remove any by deleting them from the Defline editing box. A space is inserted between fields by default, but it can be replaced with other characters by typing in the editing box. When the "Remember changes" box is checked, the defline format you defined will be remembered and used in all subsequent downloads, until it is reset or cookies are deleted in the browser. A list of GenBank accession numbers for selected protein or nucleotide sequences, and a table of the search result in XML, CSV or tab-delimited format can also be downloaded from the "Download results" menu.

Add your own sequences

Further sequences analysis of the selected sequences can be performed by clicking the "Do multiple alignment" or "Build a tree" button, if they are allowed (i.e. no mixing species and/or segments in the dataset). User's own sequences (of the same sequence type in FASTA format) can be added to the selected sequences for analysis, by clicking the "Add your own sequences" button. The number of sequences added cannot be more than 128 KB in file size.

Multiple sequence alignment

Build multiple alignment

Multiple alignments of nucleotide or protein sequences from the NCBI Influenza Virus Sequence Database and/or user's input file can be obtained, using the MUSCLE program. Start the alignment here. This will open a database query interface similar to the one described above. Please follow the instruction for database query and be sure to select sequences from the same segment of the genome, and preferably of similar sizes.

A maximum number of 1,000 is set for sequences allowed to be included in the alignment. For datasets larger than 1,000 sequences, it is recommended to download the sequences using the download tool of the database, and run the multiple sequence alignment using a program (e.g. MUSCLE) installed locally.

After sequences of interest are selected from the database and/or added from an input file, click the "Do multiple alignment" button to get the alignment. The consensus sequence is displayed at the top of the alignment, and identical sequences to the consensus are shown in dots and gaps are shown in dashes. In the coding region alignment, non-synonymous changes (in triplets) are highlighted in a different background color. The alignment can be navigated horizontally either by typing in the position you would like the sequences to start from in the text box after "Go to position" and clicking "Go", or by moving the bottom scroll bar that wraps the alignment. When a sequence in the alignment is clicked, a small window will be popped up.

The GenBank record for the sequence can be opened by clicking the accession number in the pop-up window. The sequence can also be selected to perform BLAST 2 Sequences (Click the "BLAST 2 seq." button after two different sequences are selected from the alignment). By clicking the "Select for anchor" option from the pop-up window, the consensus sequence will be replaced by the selected sequence. When the anchor sequence is clicked, a small window with options will be popped up. The anchor sequence can be reset to the consensus sequence, and the anchor/consensus sequence can be displayed for copying.

Download alignment

The multiple alignment file in FASTA format can be downloaded by selecting "Download alignment". A printer-friendly version of the alignment can be obtained by clicking the "Print-friendly version" button. If desired, click the "Build a tree" button to build a tree from the aligned sequences.

Phylogenetic trees

Methodology

Interactive tool DatasetExplorer is a part of the NCBI Influenza Virus Resource that provides an easy way to perform preliminary analysis on nucleotide and protein sequences from the NCBI Influenza Virus Sequence Database and/or user's input file. Datasets are visually represented using phylogenetic/clustering trees. Users can select an algorithm to be used for building a tree as well as similarity criterion.

Start the tool by clicking here. Sequences are acquired from the NCBI Influenza Virus Sequence Database or uploaded by a user as described above. After a dataset has been selected, the sequences are aligned using a multiple alignment algorithm, in order to identify common regions in the sequences and establish correspondence between sequence columns (we perform multiple protein alignment, while alignment of the nucleotide sequences for the coding regions is induced by the protein alignment). Distances between sequences are calculated based on their dissimilarity in a selected region on the alignment, and analysis is performed. We offer visualization based on phylogenetic and clustering tree methods: the classical neighbor-joining method and agglomerative hierarchical clustering methods.

Alignment of protein sequences is performed using the protein multiple alignment tool MUSCLE. We offer different distance measures for calculating pairwise distances between sequences. Particularly, we use some distances implemented in PHYLIP package, as well mPAM weight matrix.

Sequence alignment

The tool performs multiple protein alignments using the MUSCLE program and creates nucleotide alignment of the corresponding coding regions from protein alignment by using codon-amino acid correspondence.

After sequences are obtained from the NCBI Influenza Virus Sequence Database and/or users' input file, click the "Build a tree" button in the database query results page to start the process. This will bring a window with graphic view of the multiple sequence alignments.

Sequence region selection

The graphic view of the multiple alignments of sequences selected from the previous step is displayed. The black and red colors in the graphics represent the presence and absence of amino acid residues at the corresponding positions. The positions in the longest sequence of the selected set for the first and last amino acid of each sequence are shown. A histogram showing the total number of amino acid residues at each position is displayed at the top of the page. The program automatically selects the sequence region to be analyzed so that the majority of the sequences in the set will be included. The sequence region can also be defined by users by first selecting all sequences in the set, and then entering the start and end positions in the boxes provided. When clicking the "Select sequences" button, the region from sequences that have complete coverage between the two positions will be selected, and sequences excluded from the selection will be highlighted with a background color in the graphic view.

Phylogenetic/clustering tree

A clustering or phylogenetic tree can be built by selecting one of the clustering algorithms and a distance calculating method from the list, and clicking the "Next step" button.

Sequences of interest can be highlighted in the tree, and they can be selected or deselected using the check boxes to the right of each sequences.

Distance methods approximating minimum evolution

Method	Description
Neighbor-Joining	At each step, a pair with a smallest value of D_ij - bi- b_j is chosen, where D_ij is the distance between nodes i and j, and b_i = ∑kⁿ D_ij /(n-2). The distance between the new node uand each of remaining nodes is defined as D_uk = (D_ik + D_jk - D_ij ) /2. Branch lengths are defined as v_ui = (D_ij + b_i - b_j ) /2 and v_uj = (D_ij + bj - b_i ) /2 (negative lengths are truncated to zero).

Agglomerative hierarchical clustering methods

Method	Alternative name	Distance between clusters defined as:
Average Linkage	UPGMA	Average distance between pair of objects, one in one cluster, one in another
Complete Linkage	Further Neighbor	Maximum distance between pair of objects, one in one cluster, one in another
Single Linkage	Nearest Neighbor	Minimum distance between pair of objects, one in one cluster, one in another

Protein and nucleotide distances

We offer different distance measures for calculating nucleotide and protein pairwise sequence distances, such as those based on Felsenstein F84 distance and Hammering distance for nucleotide sequences; the Dayhoff PAM matrix, the JTT matrix model, the PBM model, and Kimura's approximation for protein sequences implemented in the PHYLIP package, as well as the mPAM weight matrix for protein sequences.

Tree modification

An adaptive approach is used to visualize the tree in an aggregated form adapted to the user's screen, allowing users to interactively refine or aggregate visualization of different parts of the tree (see a paper for details). A branch on the tree can be selected by clicking the root node, and the resolution of the selected branch can be changed by moving along the scale bar. The GenBank accession numbers of amino acid sequences in the selected branch of a tree can be exported by clicking the "Download accessions" button under the scale bar. Sequences on the tree can be searched by the fields in the database, and the resulting sequences or groups will be highlighted in green color.

Tree export

The complete tree can be exported in the Newick format by clicking the "Download full tree" button. The downloaded tree can be displayed by many tree-viewing programs.

Virus Variation

Influenza Virus Resource help center

How to use

Influenza virus database

Multiple sequence alignment

Phylogenetic trees

The NCBI Influenza Virus Sequence Database

Get sequences by accession

Search for keyword

Define search set

Restrict your search only for full-length sequences

Choose collection and release dates

Required segments

Full Genome Sets

Include/exclude pandemic (H1N1) viruses

Include/exclude sequences from the FLU project

Include/exclude lab strains

Include/exclude vaccine strains

Include/exclude viruses of well-defined lineages

Collapse identical sequences

Build query/multiply queries

Show results

Customize FASTA defline

Add your own sequences

Multiple sequence alignment

Build multiple alignment

Download alignment

Phylogenetic trees

Methodology

Sequence alignment

Sequence region selection

Phylogenetic/clustering tree

Distance methods approximating minimum evolution

Agglomerative hierarchical clustering methods

Protein and nucleotide distances

Tree modification

Tree export