NCBI Branchiostoma belcheri Annotation Release 100

The RefSeq genome records for Branchiostoma belcheri were annotated by the NCBI Eukaryotic Genome Annotation Pipeline, an automated pipeline that annotates genes, transcripts and proteins on draft and finished genome assemblies. This report presents statistics on the annotation products, the input data used in the pipeline and intermediate alignment results.

The annotation products are available in the sequence databases and on the FTP site.

This report provides:

Annotation Release information: The name of the release, important dates, the software version
Assemblies: A brief description of the annotated assembly(ies)
Gene and feature statistics: The counts and characteristics of the annotated features
Alignment of the annotated proteins to a set of high-quality proteins: The number of annotated proteins with hits to a set of high-quality proteins
Masking of genomic sequence: How much of the genome was masked
Transcript and protein alignments: The number and type of evidence retrieved from public databases and used for gene prediction

For more information on the annotation process, please visit the NCBI Eukaryotic Genome Annotation Pipeline page.

Annotation Release information

This annotation should be referred to as NCBI Branchiostoma belcheri Annotation Release 100

Annotation release ID: 100
Date of Entrez queries for transcripts and proteins: Dec 23 2016
Date of submission of annotation to the public databases: Dec 28 2016
Software version: 7.2

Assemblies

The following assemblies were included in this annotation run:

Assembly name	Assembly accession	Submitter	Assembly date	Reference/Alternate	Assembly content
Haploidv18h27	GCF_001625305.1	Dr. Anlong Xu's lab	04-21-2016	Reference	1 assembled chromosomes; unplaced scaffolds

Gene and feature statistics

Counts and length of annotated features are provided below for each assembly.

Feature counts

Feature	Haploidv18h27
Genes and pseudogenes	26,832
protein-coding	23,855
non-coding	1,601
pseudogenes	1,376
genes with variants	4,648
mRNAs	34,662
fully-supported	27,753
with > 5% ab initio	3,439
partial	690
with filled gap(s)	4
known RefSeq (NM_)	0
model RefSeq (XM_)	34,662
Other RNAs	2,030
fully-supported	1,624
with > 5% ab initio	0
partial	0
with filled gap(s)	0
known RefSeq (NR_)	0
model RefSeq (XR_)	1,624
CDSs	34,662
fully-supported	27,753
with > 5% ab initio	3,794
partial	689
with major correction(s)	1,086
known RefSeq (NP_)	0
model RefSeq (XP_)	34,662

Detailed reports

Feature lengths

Feature	Count	Mean length (bp)	Median length (bp)	Min length (bp)	Max length (bp)
Genes	25,456	11,024	5,974	71	309,005
All transcripts	36,692	2,628	2,037	71	73,641
mRNA	34,662	2,734	2,123	180	73,641
misc_RNA	301	2,300	1,843	122	10,977
tRNA	406	74	73	71	84
lncRNA	1,323	703	514	96	6,126
Single-exon transcripts	1,381	1,642	1,394	270	10,869
coding transcripts (NM_/XM_ )	1,381	1,642	1,394	270	10,869
CDSs	34,662	1,938	1,413	177	72,189
Exons	243,291	250	140	1	13,525
in coding transcripts (NM_/XM_ )	238,947	250	140	1	13,525
in non-coding transcripts (NR_/XR_ )	5,848	229	123	2	5,790
Introns	217,622	1,180	462	30	200,091
in coding transcripts (NM_/XM_ )	214,690	1,174	461	30	200,091
in non-coding transcripts (NR_/XR_ )	4,392	1,600	507	30	87,510

Transcripts per gene, exons per transcript

	Mean	Median	Min	Max
Number of transcripts per gene	1.45	1	1	50
Number of exons per transcript	11.21	8	1	398

Alignment of the annotated proteins to a set of high-quality proteins

The final set of annotated proteins was searched with BLASTP against the UniProtKB/Swiss-Prot curated proteins, using the annotated proteins as the query and the high-quality proteins as the target. Out of 23855 coding genes, 16571 genes had a protein with an alignment covering 50% or more of the query and 3665 had an alignment covering 95% or more of the query.

Definition of query and target coverage. The query coverage is the percentage of the annotated protein length that is included in the alignment. The target coverage is the percentage of the target length that is included in the alignment.

Below is a cumulative graph displaying the number of genes with alignments above a given query or target coverage threshold. For comparison, corresponding statistics for other organisms annotated by the NCBI eukaryotic annotation pipeline were added to the graph.

Query: annotated proteins
Target: UniProtKB/Swiss-Prot curated proteins

Masking of genomic sequence

Transcript and protein alignments are performed on the repeat-masked genome. Below are the percentages of genomic sequence masked by WindowMasker and RepeatMasker for each assembly. RepeatMasker results are only used for organisms for which a comprehensive repeat library is available.

For this annotation run, transcripts and proteins were aligned to the genome masked with WindowMasker only.

Assembly name	Assembly accession	% Masked with RepeatMasker	% Masked with WindowMasker
Haploidv18h27	GCF_001625305.1	1.71%	27.05%

Transcript and protein alignments

The annotation pipeline relies heavily on alignments of experimental evidence for gene prediction. Below are the sets of transcripts and proteins that were retrieved from Entrez, aligned to the genome by Splign or ProSplign and passed to Gnomon, NCBI's gene prediction software.

Depending on the other evidence available, long 454 reads (with average length above 250 nt) may be aligned as traditional evidence and reported in the Transcript alignments section or aligned with RNA-Seq reads and reported in the RNA-Seq alignments section.

Transcript alignments

Source	Number of sequences retrieved from Entrez	Number (%) of sequences aligned by Splign	Number (%) of sequences passed to Gnomon	Average % identity	Average % coverage
Same-species Genbank	518	484 (93.44%)	147 (28.38%)	97.14%	97.67%
Same-species EST	24,838	20,967 (84.42%)	16,894 (68.02%)	97.95%	99.40%
Branchiostomidae Genbank	776	699 (90.08%)	267 (34.41%)	91.08%	91.95%
Branchiostomidae EST	334,502	89,561 (26.77%)	60,033 (17.95%)	91.10%	97.92%

RNA-Seq alignments

The following RNA-Seq reads from the Sequence Read Archive were also used for gene prediction:

Hide alignments statistics, by sample (SAME, SAMN, SAMD, DRS)

Sample Id	Track name	Number of reads	Percent aligned reads	Percent spliced reads	Number of introns
All	Aggregate of all aligned samples	1,349,969,933	49%	16%	248,812
SAMN00000122	intestine (Branchiostoma belcheri, SAMN00000122)	393,710	45%	10%	16,701
SAMN00854329	The Transcripts from the gill tissue of Branchiostoma belcheri (Branchiostoma belcheri, SAMN00854329)	476,739	46%	20%	36,823
SAMN00854330	The Transcripts from the embryo of Branchiostoma belcheri (Branchiostoma belcheri, SAMN00854330)	1,097,418	79%	3%	31,414
SAMN00854331	The Transcripts from the liver tissue of Branchiostoma Belcheri (Branchiostoma belcheri, SAMN00854331)	451,959	61%	8%	24,550
SAMN02194674	embryo (Branchiostoma belcheri, SAMN02194674)	525,973,634	39%	12%	236,063
SAMN02342536	multiple adult lancelets (Branchiostoma belcheri, SAMN02342536)	529,303,357	59%	19%	241,356
SAMN02582423	digestive tract (Branchiostoma belcheri, SAMN02582423)	175,533,326	51%	18%	213,647
SAMN05188724	whole organism (Branchiostoma belcheri, male, SAMN05188724)	1,769,412	42%	14%	58,061
SAMN05188736	whole organism (Branchiostoma belcheri, pooled male and female, SAMN05188736)	51,081,486	37%	8%	94,799
SAMN05188738	whole organism (Branchiostoma belcheri, pooled male and female, SAMN05188738)	63,888,892	57%	14%	183,727

Show alignments statistics, by run (ERR, SRR, DRR)

Run	Experiment	Project	Sample	Number of reads	Percent aligned reads	Percent spliced reads
SRR001033	SRX000197	SRP000126	SAMN00000122	170,607	48%	11%
SRR001032	SRX000198	SRP000126	SAMN00000122	223,103	43%	8%
SRR458512	SRX137009	SRP012113	SAMN00854329	476,739	46%	20%
SRR458514	SRX137010	SRP012113	SAMN00854330	594,167	79%	3%
SRR458515	SRX137010	SRP012113	SAMN00854330	503,251	79%	3%
SRR458513	SRX137015	SRP012113	SAMN00854331	451,959	61%	8%
SRR892741	SRX299418	SRP025148	SAMN02194674	41,330,678	69%	22%
SRR892753	SRX299418	SRP025148	SAMN02194674	48,702,570	72%	23%
SRR892758	SRX299418	SRP025148	SAMN02194674	74,707,328	27%	9%
SRR892759	SRX299418	SRP025148	SAMN02194674	71,810,316	34%	12%
SRR892760	SRX299418	SRP025148	SAMN02194674	73,751,022	30%	10%
SRR892761	SRX299418	SRP025148	SAMN02194674	74,558,060	23%	6%
SRR892762	SRX299418	SRP025148	SAMN02194674	72,776,834	32%	11%
SRR892771	SRX299418	SRP025148	SAMN02194674	68,336,826	49%	16%
SRR964432	SRX344152	SRP029462	SAMN02342536	476,739	46%	20%
SRR964433	SRX344152	SRP029462	SAMN02342536	451,959	61%	8%
SRR964434	SRX344153	SRP029462	SAMN02342536	594,167	79%	3%
SRR964435	SRX344153	SRP029462	SAMN02342536	503,251	79%	3%
SRR964436	SRX344154	SRP029462	SAMN02342536	665,046	69%	15%
SRR964438	SRX344154	SRP029462	SAMN02342536	638,561	67%	13%
SRR964444	SRX344155	SRP029462	SAMN02342536	73,751,022	61%	20%
SRR964474	SRX344156	SRP029462	SAMN02342536	41,330,678	69%	22%
SRR964578	SRX344156	SRP029462	SAMN02342536	68,336,826	49%	16%
SRR964579	SRX344156	SRP029462	SAMN02342536	48,702,570	72%	23%
SRR964580	SRX344156	SRP029462	SAMN02342536	74,707,328	54%	18%
SRR964581	SRX344156	SRP029462	SAMN02342536	71,810,316	69%	25%
SRR964582	SRX344156	SRP029462	SAMN02342536	74,558,060	45%	13%
SRR964583	SRX344156	SRP029462	SAMN02342536	72,776,834	64%	22%
SRR1107642	SRX425456	SRP035372	SAMN02582423	33,176,258	60%	26%
SRR1107643	SRX425456	SRP035372	SAMN02582423	26,969,068	61%	27%
SRR1107644	SRX425456	SRP035372	SAMN02582423	28,805,778	48%	11%
SRR1107645	SRX425456	SRP035372	SAMN02582423	42,785,418	47%	19%
SRR1107646	SRX425456	SRP035372	SAMN02582423	43,796,804	45%	10%
SRR3608004	SRX1809047	SRP075895	SAMN05188724	1,769,412	42%	14%
SRR3608181	SRX1809128	SRP075895	SAMN05188736	51,081,486	37%	8%
SRR3608540	SRX1809295	SRP075895	SAMN05188738	63,888,892	57%	14%

Protein alignments

Source	Number of sequences retrieved from Entrez	Number (%) of sequences aligned by ProSplign	Number (%) of sequences passed to Gnomon	Average % identity	Average % coverage
Saccoglossus kowalevskii GenBank	217	192 (88.48%)	192 (88.48%)	67.53%	50.10%
Saccoglossus kowalevskii high-quality model RefSeq (XP_)	6,124	4,317 (70.49%)	4,317 (70.49%)	68.22%	64.52%
Saccoglossus kowalevskii known RefSeq (NP_)	474	436 (91.98%)	436 (91.98%)	67.68%	60.39%
Crassostrea gigas GenBank	702	369 (52.56%)	369 (52.56%)	71.52%	75.18%
Crassostrea gigas high-quality model RefSeq (XP_)	21,362	11,637 (54.48%)	11,637 (54.48%)	57.30%	39.08%
Crassostrea gigas known RefSeq (NP_)	141	103 (73.05%)	103 (73.05%)	68.19%	62.16%
Saccharomyces cerevisiae known RefSeq (NP_)	5,891	1,320 (22.41%)	1,320 (22.41%)	58.05%	47.11%
Caenorhabditis elegans known RefSeq (NP_)	28,125	8,027 (28.54%)	8,027 (28.54%)	59.78%	40.79%
Drosophila melanogaster known RefSeq (NP_)	30,469	14,171 (46.51%)	14,171 (46.51%)	61.27%	46.63%
Strongylocentrotus purpuratus GenBank	1,305	459 (35.17%)	459 (35.17%)	65.08%	56.03%
Strongylocentrotus purpuratus high-quality model RefSeq (XP_)	13,741	8,625 (62.77%)	8,625 (62.77%)	60.22%	45.78%
Strongylocentrotus purpuratus known RefSeq (NP_)	427	343 (80.33%)	343 (80.33%)	74.01%	69.58%
Tunicata GenBank	1,274	706 (55.42%)	706 (55.42%)	73.80%	73.24%
Ciona intestinalis GenBank	1,247	787 (63.11%)	787 (63.11%)	62.42%	43.92%
Ciona intestinalis high-quality model RefSeq (XP_)	10,477	6,026 (57.52%)	6,026 (57.52%)	58.37%	44.28%
Ciona intestinalis known RefSeq (NP_)	948	624 (65.82%)	624 (65.82%)	61.38%	43.39%
Branchiostomidae GenBank	1,227	1,213 (98.86%)	1,213 (98.86%)	78.15%	87.64%
Branchiostoma floridae model RefSeq (XP_)	28,623	27,215 (95.08%)	27,215 (95.08%)	70.15%	81.38%
Danio rerio GenBank	27,104	20,131 (74.27%)	20,131 (74.27%)	62.85%	53.10%
Danio rerio known RefSeq (NP_)	15,648	11,689 (74.70%)	11,689 (74.70%)	61.58%	51.31%
Homo sapiens known RefSeq (NP_)	45,035	30,787 (68.36%)	30,787 (68.36%)	60.68%	47.53%

References

RefSeq: Pruitt KD, Brown GR, Hiatt SM, Thibaud-Nissen F, Astashyn A, Ermolaeva O, Farrell CM, Hart J, Landrum MJ, McGarvey KM, Murphy MR, O'Leary NA, Pujar S, Rajput B, Rangwala SH, Riddick LD, Shkeda A, Sun H, Tamez P, Tully RE, Wallin C, Webb D, Weber J, Wu W, Dicuccio M, Kitts P, Maglott DR, Murphy TD, Ostell JM. Nucleic Acids Research 2014, 42(Database issue):D756-63
RepeatMasker: Smit AFA, Hubley R, Green P. RepeatMasker Open-3.0. 1996–2004. http://www.repeatmasker.org
WindowMasker: Morgulis A, Gertz EM, Schäffer AA, Agarwala R. Bioinformatics 2006, 2:134-41
Splign: Kapustin Y, Souvorov A, Tatusova T, Lipman D. Biology Direct 2008, 3:20

RefSeq

Integrated reference sequences