NCBI Poecile atricapillus Annotation Release GCF_030490865.1-RS_2023_08

The genome sequence records for Poecile atricapillus RefSeq assembly GCF_030490865.1 (bPoeAtr1.hap1) were annotated by the NCBI Eukaryotic Genome Annotation Pipeline, an automated pipeline that annotates genes, transcripts and proteins on draft and finished genome assemblies.

The annotation products are available in the sequence databases and on the FTP site.

This report provides:

Annotation Release information: The name of the release, important dates, the software version
Assemblies: A brief description of the annotated assembly(ies)
Gene and feature statistics: The counts and characteristics of the annotated features
BUSCO results: Annotation completeness assessed with BUSCO
Alignment of the annotated proteins to a set of high-quality proteins: The number of annotated proteins with hits to a set of high-quality proteins
Masking of genomic sequence: How much of the genome was masked
Transcript and protein alignments: The number and type of evidence retrieved from public databases and used for gene prediction

For more information on the annotation process, please visit the NCBI Eukaryotic Genome Annotation Pipeline page.

Annotation Release information

This annotation should be referred to as "GCF_030490865.1-RS_2023_08".

Date of Entrez queries for transcripts and proteins: Aug 11 2023
Date of submission of annotation to the public databases: Aug 14 2023
Software version: 10.1

Assemblies

The following assemblies were included in this annotation run:

Assembly name	Assembly accession	Submitter	Assembly date	Reference/Alternate	Assembly content
bPoeAtr1.hap1	GCF_030490865.1	Vertebrate Genomes Project	07-18-2023	Reference	42 assembled chromosomes; unplaced scaffolds

Gene and feature statistics

Counts and length of annotated features are provided below for each assembly.

Feature counts

Feature	bPoeAtr1.hap1
Genes and pseudogenes	20,627
protein-coding	17,743
non-coding	2,518
Transcribed pseudogenes	1
Non-transcribed pseudogenes	264
genes with variants	8,284
Immunoglobulin/T-cell receptor gene segments	86
other	15
mRNAs	39,653
fully-supported	38,012
with > 5% ab initio	961
partial	213
with filled gap(s)	0
known RefSeq (NM_)	0
model RefSeq (XM_)	39,653
non-coding RNAs	5,003
fully-supported	4,378
with > 5% ab initio	0
partial	0
with filled gap(s)	0
known RefSeq (NR_)	0
model RefSeq (XR_)	4,734
pseudo transcripts	1
fully-supported	1
with > 5% ab initio	0
partial	0
with filled gap(s)	0
known RefSeq (NR_)	0
model RefSeq (XR_)	1
CDSs	39,752
fully-supported	38,012
with > 5% ab initio	1,091
partial	256
with major correction(s)	435
known RefSeq (NP_)	0
model RefSeq (XP_)	39,666

Detailed reports

The counts below do not include pseudogenes.

Feature lengths

Feature	Count	Mean length (bp)	Median length (bp)	Min length (bp)	Max length (bp)
Genes	20,276	30,551	11,784	61	1,141,085
All transcripts	44,656	3,487	2,790	61	98,988
mRNA	39,653	3,679	2,969	162	98,988
misc_RNA	1,615	3,336	2,583	147	16,437
tRNA	267	74	73	64	87
lncRNA	2,763	1,567	917	101	14,845
snoRNA	196	111	97	61	322
snRNA	43	148	149	62	193
rRNA	104	678	119	119	8,836
Single-exon transcripts	778	1,636	1,187	162	13,282
coding transcripts (NM_/XM_ )	778	1,636	1,187	162	13,282
CDSs	39,666	2,050	1,497	96	97,785
Exons	230,442	303	135	1	22,955
in coding transcripts (NM_/XM_ )	220,079	294	135	1	22,955
in non-coding transcripts (NR_/XR_ )	18,440	352	136	9	13,889
Introns	207,632	3,469	927	30	592,400
in coding transcripts (NM_/XM_ )	200,330	3,449	928	30	497,241
in non-coding transcripts (NR_/XR_ )	15,165	3,277	892	30	592,400

Transcripts per gene, exons per transcript

	Mean	Median	Min	Max
Number of transcripts per gene	2.22	1	1	50
Number of exons per transcript	12.4	10	1	293

BUSCO analysis of gene annotation

BUSCO v4.1.4 was run in "protein" mode on the annotated gene set picking one longest protein per gene, and run using the passeriformes_odb10 lineage dataset. Results are reported for the gene set from the primary assembly unit, and presented in BUSCO notation.

Alignment of the annotated proteins to a set of high-quality proteins

The final set of annotated proteins was searched with BLASTP against the UniProtKB/Swiss-Prot curated proteins, using the annotated proteins as the query and the high-quality proteins as the target. Out of 17730 coding genes, 17115 genes had a protein with an alignment covering 50% or more of the query and 11715 had an alignment covering 95% or more of the query.

Definition of query and target coverage. The query coverage is the percentage of the annotated protein length that is included in the alignment. The target coverage is the percentage of the target length that is included in the alignment.

Below is a cumulative graph displaying the number of genes with alignments above a given query or target coverage threshold. For comparison, corresponding statistics for other organisms annotated by the NCBI eukaryotic annotation pipeline were added to the graph.

Query: annotated proteins
Target: UniProtKB/Swiss-Prot curated proteins

Masking of genomic sequence

Transcript and protein alignments are performed on the repeat-masked genome. Below are the percentages of genomic sequence masked by WindowMasker and RepeatMasker (if calculated), for each assembly. RepeatMasker results are only calculated for organisms with complete Dfam HMM model collections.

For this annotation run, transcripts and proteins were aligned to the genome masked with WindowMasker only.

Assembly name	Assembly accession	% Masked with WindowMasker
bPoeAtr1.hap1	GCF_030490865.1	23.30%

Transcript and protein alignments

The annotation pipeline relies heavily on alignments of experimental evidence for gene prediction. Below are the sets of transcripts and proteins that were retrieved from Entrez Nucleotide, Entrez Protein, and SRA, and aligned to the genome.

Transcript alignments

The alignments of the following transcripts with Splign were used for gene prediction:

Source	Number of sequences retrieved from Entrez	Number (%) of sequences aligned by Splign	Number (%) of sequences passed to Gnomon	Average % identity	Average % coverage
Same-species Genbank	1	1 (100.00%)	1 (100.00%)	100.00%	100.00%
Aves known RefSeq (NM_/NR_)	11,281	9,481 (84.04%)	3,760 (33.33%)	91.21%	85.66%
Aves Genbank	44,624	28,855 (64.66%)	13,844 (31.02%)	91.36%	91.56%
Aves EST	756,952	254,644 (33.64%)	176,364 (23.30%)	91.80%	97.23%

RNA-Seq alignments

The alignments of the following RNA-Seq reads with STAR were also used for gene prediction:

Hide alignments statistics, by sample (SAME, SAMN, SAMD, DRS)

Sample Id	Track name	Number of reads	Percent aligned reads	Percent of aligned reads with introns	Number of introns
All	Aggregate of all aligned samples	1,921,591,425	67%	40%	311,652
SAMN07838899	pectoral muscle (Poecile atricapillus, adult, not collected, SAMN07838899)	19,611,086	73%	32%	133,103
SAMN07838900	pectoral muscle (Poecile atricapillus, adult, not collected, SAMN07838900)	20,947,568	78%	31%	139,946
SAMN07838901	pectoral muscle (Poecile atricapillus, adult, not collected, SAMN07838901)	24,529,220	74%	33%	141,725
SAMN07838902	pectoral muscle (Poecile atricapillus, adult, not collected, SAMN07838902)	19,116,662	72%	33%	132,695
SAMN07838903	pectoral muscle (Poecile atricapillus, adult, not collected, SAMN07838903)	19,256,991	74%	33%	132,082
SAMN07838904	pectoral muscle (Poecile atricapillus, adult, not collected, SAMN07838904)	17,333,139	71%	32%	126,765
SAMN07838905	pectoral muscle (Poecile atricapillus, adult, not collected, SAMN07838905)	19,603,423	70%	31%	128,277
SAMN07838906	pectoral muscle (Poecile atricapillus, adult, not collected, SAMN07838906)	19,989,906	70%	31%	133,054
SAMN07838907	pectoral muscle (Poecile atricapillus, adult, not collected, SAMN07838907)	18,419,188	72%	31%	135,997
SAMN07838908	pectoral muscle (Poecile atricapillus, adult, not collected, SAMN07838908)	20,459,563	72%	32%	132,629
SAMN07838909	pectoral muscle (Poecile atricapillus, adult, not collected, SAMN07838909)	24,149,528	71%	35%	134,777
SAMN07838910	pectoral muscle (Poecile atricapillus, adult, not collected, SAMN07838910)	16,759,036	62%	33%	117,803
SAMN07838911	pectoral muscle (Poecile atricapillus, adult, not collected, SAMN07838911)	20,820,595	68%	35%	128,598
SAMN07838912	pectoral muscle (Poecile atricapillus, adult, not collected, SAMN07838912)	46,864,513	64%	33%	145,050
SAMN07838913	pectoral muscle (Poecile atricapillus, adult, not collected, SAMN07838913)	24,053,262	59%	33%	126,469
SAMN07838914	pectoral muscle (Poecile atricapillus, adult, not collected, SAMN07838914)	18,958,318	65%	33%	125,679
SAMN07838915	pectoral muscle (Poecile atricapillus, adult, not collected, SAMN07838915)	17,447,049	63%	33%	120,679
SAMN09104427	Cardiac muscle (Poecile palustris, Adult, female, SAMN09104427)	62,963,128	66%	50%	172,295
SAMN09104428	Liver (Poecile palustris, Adult, female, SAMN09104428)	73,078,576	79%	53%	175,286
SAMN09104429	Flight muscle (Poecile palustris, Adult, female, SAMN09104429)	57,297,720	70%	54%	159,993
SAMN09104430	Lung (Poecile palustris, Adult, female, SAMN09104430)	62,299,106	83%	37%	170,099
SAMN09104431	Cardiac muscle (Poecile palustris, Adult, female, SAMN09104431)	87,783,300	56%	48%	167,908
SAMN09104432	Kidney (Poecile palustris, Adult, female, SAMN09104432)	60,908,404	67%	42%	156,048
SAMN09104433	Liver (Poecile palustris, Adult, female, SAMN09104433)	91,720,684	69%	49%	149,584
SAMN09104434	Flight muscle (Poecile palustris, Adult, female, SAMN09104434)	89,089,754	45%	47%	137,531
SAMN09104435	Lung (Poecile palustris, Adult, female, SAMN09104435)	62,159,298	82%	37%	173,308
SAMN09104436	Cardiac muscle (Poecile palustris, Adult, female, SAMN09104436)	92,303,224	56%	49%	145,765
SAMN09104437	Kidney (Poecile palustris, Adult, female, SAMN09104437)	60,427,980	73%	41%	158,718
SAMN09104438	Liver (Poecile palustris, Adult, female, SAMN09104438)	113,241,732	67%	47%	149,611
SAMN09104439	Flight muscle (Poecile palustris, Adult, female, SAMN09104439)	90,076,686	44%	50%	141,631
SAMN09104440	Lung (Poecile palustris, Adult, male, SAMN09104440)	62,596,176	81%	45%	156,827
SAMN09104441	Cardiac muscle (Poecile palustris, Adult, male, SAMN09104441)	86,123,126	56%	49%	160,994
SAMN09104442	Kidney (Poecile palustris, Adult, male, SAMN09104442)	61,182,730	71%	42%	159,702
SAMN09104443	Liver (Poecile palustris, Adult, male, SAMN09104443)	96,651,514	71%	49%	132,900
SAMN09104444	Flight muscle (Poecile palustris, Adult, male, SAMN09104444)	92,851,036	43%	50%	132,765
SAMN13755300	blood (Poecile montanus, SAMN13755300)	67,348,076	91%	9%	100,251
SAMN13755301	liver (Poecile palustris, SAMN13755301)	83,170,128	81%	11%	123,204

Show alignments statistics, by run (ERR, SRR, DRR)

Run	Experiment	Project	Sample	Number of reads	Percent aligned reads	Percent of aligned reads with introns
SRR6255691	SRX3362170	SRP123632	SAMN07838899	19,611,086	73%	32%
SRR6255694	SRX3362167	SRP123632	SAMN07838900	20,947,568	78%	31%
SRR6255693	SRX3362168	SRP123632	SAMN07838901	24,529,220	74%	33%
SRR6255703	SRX3362158	SRP123632	SAMN07838902	19,116,662	72%	33%
SRR6255704	SRX3362157	SRP123632	SAMN07838903	19,256,991	74%	33%
SRR6255705	SRX3362156	SRP123632	SAMN07838904	17,333,139	71%	32%
SRR6255706	SRX3362155	SRP123632	SAMN07838905	19,603,423	70%	31%
SRR6255699	SRX3362162	SRP123632	SAMN07838906	19,989,906	70%	31%
SRR6255700	SRX3362161	SRP123632	SAMN07838907	18,419,188	72%	31%
SRR6255701	SRX3362160	SRP123632	SAMN07838908	20,459,563	72%	32%
SRR6255702	SRX3362159	SRP123632	SAMN07838909	24,149,528	71%	35%
SRR6255707	SRX3362154	SRP123632	SAMN07838910	16,759,036	62%	33%
SRR6255708	SRX3362153	SRP123632	SAMN07838911	20,820,595	68%	35%
SRR6255696	SRX3362165	SRP123632	SAMN07838912	46,864,513	64%	33%
SRR6255695	SRX3362166	SRP123632	SAMN07838913	24,053,262	59%	33%
SRR6255698	SRX3362163	SRP123632	SAMN07838914	18,958,318	65%	33%
SRR6255697	SRX3362164	SRP123632	SAMN07838915	17,447,049	63%	33%
SRR7244688	SRX4149503	SRP149501	SAMN09104427	62,963,128	66%	50%
SRR7244687	SRX4149504	SRP149501	SAMN09104428	73,078,576	79%	53%
SRR7244686	SRX4149505	SRP149501	SAMN09104429	57,297,720	70%	54%
SRR7244685	SRX4149506	SRP149501	SAMN09104430	62,299,106	83%	37%
SRR7244684	SRX4149507	SRP149501	SAMN09104431	87,783,300	56%	48%
SRR7244683	SRX4149508	SRP149501	SAMN09104432	60,908,404	67%	42%
SRR7244682	SRX4149509	SRP149501	SAMN09104433	91,720,684	69%	49%
SRR7244681	SRX4149510	SRP149501	SAMN09104434	89,089,754	45%	47%
SRR7244680	SRX4149511	SRP149501	SAMN09104435	62,159,298	82%	37%
SRR7244679	SRX4149512	SRP149501	SAMN09104436	92,303,224	56%	49%
SRR7244711	SRX4149480	SRP149501	SAMN09104437	60,427,980	73%	41%
SRR7244712	SRX4149479	SRP149501	SAMN09104438	113,241,732	67%	47%
SRR7244709	SRX4149482	SRP149501	SAMN09104439	90,076,686	44%	50%
SRR7244710	SRX4149481	SRP149501	SAMN09104440	62,596,176	81%	45%
SRR7244715	SRX4149476	SRP149501	SAMN09104441	86,123,126	56%	49%
SRR7244716	SRX4149475	SRP149501	SAMN09104442	61,182,730	71%	42%
SRR7244713	SRX4149478	SRP149501	SAMN09104443	96,651,514	71%	49%
SRR7244714	SRX4149477	SRP149501	SAMN09104444	92,851,036	43%	50%
SRR10852963	SRX7523385	SRP240625	SAMN13755300	67,348,076	91%	9%
SRR10852962	SRX7523386	SRP240625	SAMN13755301	83,170,128	81%	11%

Protein alignments

The alignments of the following proteins with ProSplign were used for gene prediction:

Source	Number of sequences retrieved from Entrez	Number (%) of sequences aligned by ProSplign	Number (%) of sequences passed to Gnomon	Average % identity	Average % coverage
Pseudopodoces humilis high-quality model RefSeq (XP_)	10,444	10,376 (99.35%)	10,376 (99.35%)	83.72%	88.32%
Xenopus known RefSeq (NP_)	19,250	18,002 (93.52%)	18,002 (93.52%)	70.06%	79.35%
Aves GenBank	15,616	14,709 (94.19%)	14,709 (94.19%)	71.83%	84.02%
Aves known RefSeq (NP_)	10,026	9,821 (97.96%)	9,821 (97.96%)	77.29%	85.37%
Columba livia high-quality model RefSeq (XP_)	8,292	8,186 (98.72%)	8,186 (98.72%)	77.65%	85.48%
Gallus gallus high-quality model RefSeq (XP_)	9,972	9,652 (96.79%)	9,652 (96.79%)	76.49%	83.03%
Homo sapiens known RefSeq (NP_)	67,119	58,358 (86.95%)	58,358 (86.95%)	71.36%	76.84%

References

RefSeq: Pruitt KD, Brown GR, Hiatt SM, Thibaud-Nissen F, Astashyn A, Ermolaeva O, Farrell CM, Hart J, Landrum MJ, McGarvey KM, Murphy MR, O'Leary NA, Pujar S, Rajput B, Rangwala SH, Riddick LD, Shkeda A, Sun H, Tamez P, Tully RE, Wallin C, Webb D, Weber J, Wu W, Dicuccio M, Kitts P, Maglott DR, Murphy TD, Ostell JM. Nucleic Acids Research 2014, 42(Database issue):D756-63
BUSCO: Manni M, Berkeley MR, Seppey M, Simão FA, Zdobnov EM. Molecular biology and evolution 2021.38(10):4647-4654
RepeatMasker: Smit AFA, Hubley R, Green P. RepeatMasker Open-3.0. 1996–2004. http://www.repeatmasker.org
WindowMasker: Morgulis A, Gertz EM, Schäffer AA, Agarwala R. Bioinformatics 2006, 2:134-41
Splign: Kapustin Y, Souvorov A, Tatusova T, Lipman D. Biology Direct 2008, 3:20
STAR: Dobin A, Davis CA, Schlesinger F, Drenkow J, Zaleski C, Jha S, Batut P, Chaisson M, Gingeras TR. Bioinformatics 2013 Jan 1;29(1):15-21.
Minimap2: Li H. Bioinformatics 2018 Sep 15;34(18):3094-3100

RefSeq

Integrated reference sequences