NCBI Humulus lupulus Annotation Release GCF_963169125.1-RS_2024_01

The genome sequence records for Humulus lupulus RefSeq assembly GCF_963169125.1 (drHumLupu1.1) were annotated by the NCBI Eukaryotic Genome Annotation Pipeline, an automated pipeline that annotates genes, transcripts and proteins on draft and finished genome assemblies.

The annotation products are available in the sequence databases and on the FTP site.

This report provides:

Annotation Release information: The name of the release, important dates, the software version
Assemblies: A brief description of the annotated assembly(ies)
Gene and feature statistics: The counts and characteristics of the annotated features
BUSCO results: Annotation completeness assessed with BUSCO
Alignment of the annotated proteins to a set of high-quality proteins: The number of annotated proteins with hits to a set of high-quality proteins
Masking of genomic sequence: How much of the genome was masked
Transcript and protein alignments: The number and type of evidence retrieved from public databases and used for gene prediction

For more information on the annotation process, please visit the NCBI Eukaryotic Genome Annotation Pipeline page.

Annotation Release information

This annotation should be referred to as "GCF_963169125.1-RS_2024_01".

Date of Entrez queries for transcripts and proteins: Jan 4 2024
Date of submission of annotation to the public databases: Jan 9 2024
Software version: 10.2

Assemblies

The following assemblies were included in this annotation run:

Assembly name	Assembly accession	Submitter	Assembly date	Reference/Alternate	Assembly content
drHumLupu1.1	GCF_963169125.1	WELLCOME SANGER INSTITUTE	08-19-2023	Reference	10 assembled chromosomes; unplaced scaffolds

Gene and feature statistics

Counts and length of annotated features are provided below for each assembly.

Feature counts

Feature	drHumLupu1.1
Genes and pseudogenes	57,465
protein-coding	35,582
non-coding	19,381
Transcribed pseudogenes	7
Non-transcribed pseudogenes	2,495
genes with variants	7,832
Immunoglobulin/T-cell receptor gene segments	0
other	0
mRNAs	47,669
fully-supported	34,903
with > 5% ab initio	11,664
partial	224
with filled gap(s)	22
known RefSeq (NM_)	0
model RefSeq (XM_)	47,669
non-coding RNAs	25,462
fully-supported	12,235
with > 5% ab initio	0
partial	7
with filled gap(s)	0
known RefSeq (NR_)	0
model RefSeq (XR_)	24,937
pseudo transcripts	7
fully-supported	6
with > 5% ab initio	0
partial	0
with filled gap(s)	0
known RefSeq (NR_)	0
model RefSeq (XR_)	7
CDSs	47,669
fully-supported	34,903
with > 5% ab initio	11,848
partial	219
with major correction(s)	67
known RefSeq (NP_)	0
model RefSeq (XP_)	47,669

Detailed reports

The counts below do not include pseudogenes.

Feature lengths

Feature	Count	Mean length (bp)	Median length (bp)	Min length (bp)	Max length (bp)
Genes	54,963	3,077	1,808	60	466,706
All transcripts	73,131	1,551	1,277	60	36,119
mRNA	47,669	1,818	1,534	144	36,119
misc_RNA	3,408	2,371	1,899	174	19,676
tRNA	525	74	73	71	88
lncRNA	8,847	1,319	810	85	31,096
snoRNA	5,361	107	107	60	241
snRNA	96	141	119	98	197
rRNA	7,225	881	119	114	3,528
Single-exon transcripts	7,171	1,172	970	144	9,684
coding transcripts (NM_/XM_ )	7,171	1,172	970	144	9,684
CDSs	47,669	1,332	1,092	93	16,596
Exons	214,747	344	177	1	34,543
in coding transcripts (NM_/XM_ )	186,395	342	177	1	34,543
in non-coding transcripts (NR_/XR_ )	34,998	334	158	11	25,789
Introns	166,021	722	182	30	127,346
in coding transcripts (NM_/XM_ )	146,447	657	171	30	127,346
in non-coding transcripts (NR_/XR_ )	25,875	1,099	296	30	93,667

Transcripts per gene, exons per transcript

	Mean	Median	Min	Max
Number of transcripts per gene	1.33	1	1	50
Number of exons per transcript	4.83	3	1	79

BUSCO analysis of gene annotation

BUSCO v4.1.4 was run in "protein" mode on the annotated gene set picking one longest protein per gene, and run using the eudicots_odb10 lineage dataset. Results are reported for the gene set from the primary assembly unit, and presented in BUSCO notation.

Alignment of the annotated proteins to a set of high-quality proteins

The final set of annotated proteins was searched with BLASTP against the Arabidopsis thaliana known RefSeq proteins, using the annotated proteins as the query and the high-quality proteins as the target. Out of 35582 coding genes, 27370 genes had a protein with an alignment covering 50% or more of the query and 10669 had an alignment covering 95% or more of the query.

Definition of query and target coverage. The query coverage is the percentage of the annotated protein length that is included in the alignment. The target coverage is the percentage of the target length that is included in the alignment.

Below is a cumulative graph displaying the number of genes with alignments above a given query or target coverage threshold. For comparison, corresponding statistics for other organisms annotated by the NCBI eukaryotic annotation pipeline were added to the graph.

Query: annotated proteins
Target: Arabidopsis thaliana known RefSeq proteins

Masking of genomic sequence

Transcript and protein alignments are performed on the repeat-masked genome. Below are the percentages of genomic sequence masked by WindowMasker and RepeatMasker (if calculated), for each assembly. RepeatMasker results are only calculated for organisms with complete Dfam HMM model collections.

For this annotation run, transcripts and proteins were aligned to the genome masked with WindowMasker only.

Assembly name	Assembly accession	% Masked with WindowMasker
drHumLupu1.1	GCF_963169125.1	60.68%

Transcript and protein alignments

The annotation pipeline relies heavily on alignments of experimental evidence for gene prediction. Below are the sets of transcripts and proteins that were retrieved from Entrez Nucleotide, Entrez Protein, and SRA, and aligned to the genome.

Transcript alignments

The alignments of the following transcripts with Splign were used for gene prediction:

Source	Number of sequences retrieved from Entrez	Number (%) of sequences aligned by Splign	Number (%) of sequences passed to Gnomon	Average % identity	Average % coverage
Same-species Genbank	174	168 (96.55%)	157 (90.23%)	99.07%	98.88%
Same-species TSA	562,867	456,128 (81.04%)	319,336 (56.73%)	98.77%	96.11%
Same-species EST	25,677	20,945 (81.57%)	19,530 (76.06%)	98.85%	98.80%

RNA-Seq alignments

The alignments of the following RNA-Seq reads with STAR were also used for gene prediction:

Hide alignments statistics, by sample (SAME, SAMN, SAMD, DRS)

Sample Id	Publication	Track name	Number of reads	Percent aligned reads	Percent of aligned reads with introns	Number of introns
All	NA	Aggregate of all aligned samples	5,577,208,123	78%	27%	193,867
SAMN01180278	23347725	Hop lupulin glands from Taurus (Mainburg, Germany collection) (Humulus lupulus, SAMN01180278)	17,807,934	87%	23%	111,080
SAMN01180281	23347725	Hop lupulin glands from Taurus (Saskatoon, Canada collection) (Humulus lupulus, SAMN01180281)	16,141,990	70%	22%	106,084
SAMN01180282	23347725	Hop lupulin glands from Apollo (Yakima, USA collection) (Humulus lupulus, SAMN01180282)	19,569,670	88%	21%	115,299
SAMN01180283	23347725	Hop lupulin glands from Nugget (Mainburg, Germany collection) (Humulus lupulus, SAMN01180283)	25,584,550	84%	21%	113,585
SAMN01180284	23347725	Hop lupulin glands from Magnum (Mainburg, Germany collection) (Humulus lupulus, SAMN01180284)	19,990,336	83%	20%	105,895
SAMN01180285	23347725	Hop cone with lupulin glands removed fromTaurus (Mainburg, Germany collection) (Humulus lupulus, SAMN01180285)	15,955,976	67%	22%	110,526
SAMN01180286	23347725	Hop cone with lupulin glands removed from Taurus (Saskaoton, Canada collection) (Humulus lupulus, SAMN01180286)	20,008,764	85%	24%	121,429
SAMN01180287	23347725	Hop cone with lupulin glands removed from Apollo (Yakima, USA collection (Humulus lupulus, SAMN01180287)	19,408,882	89%	23%	119,048
SAMN01180288	23347725	Hop leaves from Taurus (Mainburg, Germany collection) (Humulus lupulus, SAMN01180288)	24,814,798	90%	25%	119,147
SAMN01180289	23347725	Hop leaves from Taurus (Saskatoon, Canada collection) (Humulus lupulus, SAMN01180289)	20,389,098	89%	24%	109,840
SAMN01180290	23347725	Hop leaves from Apollo (Yakima, USA collection) (Humulus lupulus, SAMN01180290)	17,451,774	91%	25%	112,572
SAMN05767836	NA	roots, sprouts, leaves, stems, flowers, cones (Humulus lupulus, female, SAMN05767836)	348,065,384	83%	23%	156,195
SAMN11244946	NA	leaves (Humulus lupulus, 24 months post inoculation, female, SAMN11244946)	53,996,194	39%	16%	113,731
SAMN11244947	NA	leaves (Humulus lupulus, 24 months post inoculation, female, SAMN11244947)	63,767,432	50%	20%	127,315
SAMN11244948	NA	leaves (Humulus lupulus, 24 months post inoculation, female, SAMN11244948)	94,699,602	53%	19%	130,907
SAMN11244949	NA	leaves (Humulus lupulus, 24 months post inoculation, female, SAMN11244949)	46,546,661	63%	18%	120,881
SAMN13071557	NA	stem (Humulus lupulus, SAMN13071557)	291,420,284	55%	24%	128,331
SAMN13071558	NA	stem (Humulus lupulus, SAMN13071558)	296,255,192	55%	25%	128,238
SAMN13071559	NA	meristem (Humulus lupulus, SAMN13071559)	202,833,224	85%	25%	130,939
SAMN13071560	NA	meristem (Humulus lupulus, SAMN13071560)	206,925,644	86%	25%	130,536
SAMN13071561	NA	leaf (Humulus lupulus, SAMN13071561)	229,741,820	90%	23%	111,549
SAMN13071562	NA	leaf (Humulus lupulus, SAMN13071562)	235,509,952	91%	23%	111,784
SAMN15647434	NA	female inflorescence (Humulus lupulus, SAMN15647434)	111,207,624	90%	21%	135,480
SAMN15647435	NA	female inflorescence (Humulus lupulus, SAMN15647435)	110,929,928	91%	22%	137,030
SAMN15647436	NA	female inflorescence (Humulus lupulus, SAMN15647436)	110,835,388	91%	23%	137,149
SAMN15647437	NA	female inflorescence (Humulus lupulus, SAMN15647437)	110,499,150	67%	17%	131,075
SAMN15647438	NA	female inflorescence (Humulus lupulus, SAMN15647438)	110,744,894	90%	20%	132,399
SAMN15647439	NA	female inflorescence (Humulus lupulus, SAMN15647439)	109,570,634	87%	19%	137,153
SAMN15647440	NA	female inflorescence (Humulus lupulus, SAMN15647440)	109,279,414	92%	24%	140,331
SAMN15647441	NA	female inflorescence (Humulus lupulus, SAMN15647441)	110,684,604	91%	23%	136,637
SAMN15647442	NA	female inflorescence (Humulus lupulus, SAMN15647442)	109,998,416	91%	23%	138,806
SAMN17526021	NA	leaves (Humulus lupulus, SAMN17526021)	1,057,429,800	85%	33%	172,399
SAMN32518834	NA	Leaf (Humulus lupulus, 19 months, female, SAMN32518834)	86,187,206	89%	33%	129,894
SAMN32518835	NA	Leaf (Humulus lupulus, 19 months, female, SAMN32518835)	85,689,158	90%	34%	128,426
SAMN32518836	NA	Leaf (Humulus lupulus, 19 months, female, SAMN32518836)	92,676,214	89%	33%	131,686
SAMN32518837	NA	Leaf (Humulus lupulus, 19 months, female, SAMN32518837)	97,593,182	87%	36%	131,813
SAMN32518838	NA	Leaf (Humulus lupulus, 19 months, female, SAMN32518838)	83,827,778	88%	35%	131,992
SAMN32518839	NA	Leaf (Humulus lupulus, 19 months, female, SAMN32518839)	87,152,972	88%	37%	131,063
SAMN32518840	NA	Lupulin gland (Humulus lupulus, 29 months, female, SAMN32518840)	92,718,620	73%	31%	120,766
SAMN32518841	NA	Lupulin gland (Humulus lupulus, 29 months, female, SAMN32518841)	84,801,686	73%	31%	121,736
SAMN32518842	NA	Lupulin gland (Humulus lupulus, 29 months, female, SAMN32518842)	95,230,870	72%	31%	122,444
SAMN32518843	NA	Lupulin gland (Humulus lupulus, 27 months, female, SAMN32518843)	95,036,132	48%	30%	121,662
SAMN32518844	NA	Lupulin gland (Humulus lupulus, 27 months, female, SAMN32518844)	255,818,736	48%	30%	130,049
SAMN32518845	NA	Lupulin gland (Humulus lupulus, 27 months, female, SAMN32518845)	82,410,556	48%	30%	119,982

Show alignments statistics, by run (ERR, SRR, DRR)

Run	Experiment	Project	Sample	Number of reads	Percent aligned reads	Percent of aligned reads with introns
SRR575189	SRX188987	SRP015829	SAMN01180278	17,807,934	87%	23%
SRR575191	SRX188988	SRP015829	SAMN01180281	16,141,990	70%	22%
SRR575193	SRX188989	SRP015829	SAMN01180282	19,569,670	88%	21%
SRR575195	SRX188990	SRP015829	SAMN01180283	25,584,550	84%	21%
SRR575197	SRX188991	SRP015829	SAMN01180284	19,990,336	83%	20%
SRR575199	SRX188992	SRP015829	SAMN01180285	15,955,976	67%	22%
SRR575201	SRX188993	SRP015829	SAMN01180286	20,008,764	85%	24%
SRR575203	SRX188994	SRP015829	SAMN01180287	19,408,882	89%	23%
SRR575205	SRX188995	SRP015829	SAMN01180288	24,814,798	90%	25%
SRR575207	SRX188996	SRP015829	SAMN01180289	20,389,098	89%	24%
SRR575209	SRX188997	SRP015829	SAMN01180290	17,451,774	91%	25%
SRR4242068	SRX2162946	SRP089843	SAMN05767836	348,065,384	83%	23%
SRR8775478	SRX5565566	SRP189269	SAMN11244946	18,874,544	47%	17%
SRR8775477	SRX5565567	SRP189269	SAMN11244946	19,274,640	32%	16%
SRR8775467	SRX5565577	SRP189269	SAMN11244946	15,847,010	38%	16%
SRR8775475	SRX5565569	SRP189269	SAMN11244947	16,569,472	61%	20%
SRR8775468	SRX5565576	SRP189269	SAMN11244947	47,197,960	46%	20%
SRR8775474	SRX5565570	SRP189269	SAMN11244948	36,186,216	49%	18%
SRR8775473	SRX5565571	SRP189269	SAMN11244948	30,757,861	56%	21%
SRR8775471	SRX5565573	SRP189269	SAMN11244948	27,755,525	55%	18%
SRR8775472	SRX5565572	SRP189269	SAMN11244949	24,778,032	62%	18%
SRR8775470	SRX5565574	SRP189269	SAMN11244949	21,768,629	63%	18%
SRR10320796	SRX7031667	SRP226517	SAMN13071557	291,420,284	55%	24%
SRR10320795	SRX7031666	SRP226517	SAMN13071558	296,255,192	55%	25%
SRR10320794	SRX7031665	SRP226517	SAMN13071559	202,833,224	85%	25%
SRR10320793	SRX7031664	SRP226517	SAMN13071560	206,925,644	86%	25%
SRR10320792	SRX7031663	SRP226517	SAMN13071561	229,741,820	90%	23%
SRR10320791	SRX7031662	SRP226517	SAMN13071562	235,509,952	91%	23%
SRR12329023	SRX8829243	SRP273665	SAMN15647434	111,207,624	90%	21%
SRR12329022	SRX8829244	SRP273665	SAMN15647435	110,929,928	91%	22%
SRR12329021	SRX8829245	SRP273665	SAMN15647436	110,835,388	91%	23%
SRR12329020	SRX8829246	SRP273665	SAMN15647437	110,499,150	67%	17%
SRR12329019	SRX8829247	SRP273665	SAMN15647438	110,744,894	90%	20%
SRR12329027	SRX8829239	SRP273665	SAMN15647439	109,570,634	87%	19%
SRR12329026	SRX8829240	SRP273665	SAMN15647440	109,279,414	92%	24%
SRR12329025	SRX8829241	SRP273665	SAMN15647441	110,684,604	91%	23%
SRR12329024	SRX8829242	SRP273665	SAMN15647442	109,998,416	91%	23%
SRR13528971	SRX9937278	SRP303278	SAMN17526021	64,832,440	77%	33%
SRR13528970	SRX9937279	SRP303278	SAMN17526021	62,672,492	88%	33%
SRR13528969	SRX9937280	SRP303278	SAMN17526021	69,640,936	81%	34%
SRR13528968	SRX9937281	SRP303278	SAMN17526021	62,470,646	88%	33%
SRR13528966	SRX9937282	SRP303278	SAMN17526021	83,179,450	78%	35%
SRR13528965	SRX9937283	SRP303278	SAMN17526021	74,735,618	87%	35%
SRR13528964	SRX9937284	SRP303278	SAMN17526021	84,280,634	90%	33%
SRR13528967	SRX9937285	SRP303278	SAMN17526021	85,083,126	85%	32%
SRR13528963	SRX9937286	SRP303278	SAMN17526021	95,683,074	84%	33%
SRR13528962	SRX9937287	SRP303278	SAMN17526021	61,639,362	87%	35%
SRR13528961	SRX9937288	SRP303278	SAMN17526021	92,677,476	81%	35%
SRR13528960	SRX9937289	SRP303278	SAMN17526021	85,952,000	85%	34%
SRR13528959	SRX9937290	SRP303278	SAMN17526021	64,590,714	90%	29%
SRR13528958	SRX9937291	SRP303278	SAMN17526021	69,991,832	86%	35%
SRR22957819	SRX18914501	SRP415597	SAMN32518834	86,187,206	89%	33%
SRR22957818	SRX18914502	SRP415597	SAMN32518835	85,689,158	90%	34%
SRR22957815	SRX18914505	SRP415597	SAMN32518836	92,676,214	89%	33%
SRR22957814	SRX18914506	SRP415597	SAMN32518837	97,593,182	87%	36%
SRR22957813	SRX18914507	SRP415597	SAMN32518838	83,827,778	88%	35%
SRR22957812	SRX18914508	SRP415597	SAMN32518839	87,152,972	88%	37%
SRR22957811	SRX18914509	SRP415597	SAMN32518840	92,718,620	73%	31%
SRR22957810	SRX18914510	SRP415597	SAMN32518841	84,801,686	73%	31%
SRR22957809	SRX18914511	SRP415597	SAMN32518842	95,230,870	72%	31%
SRR22957808	SRX18914512	SRP415597	SAMN32518843	95,036,132	48%	30%
SRR22957817	SRX18914503	SRP415597	SAMN32518844	255,818,736	48%	30%
SRR22957816	SRX18914504	SRP415597	SAMN32518845	82,410,556	48%	30%

SRA Long Read Alignment Statistics

The alignments of the following long RNA-Seq reads (PacBio, Oxford Nanopore, 454, or other long-read sequencing technologies) from the Sequence Read Archive with minimap2 were used for gene prediction:

Run	Sample	Number of reads	Number (%) of sequences aligned by Minimap2	Number (%) of sequences passed to Gnomon	Average % identity	Average % coverage
All	NA	801662	733826 (91.53%)	534497 (66.67%)	99.13	98.96
SRR546165	SAMN01121758	326354	299107 (91.65%)	221512 (67.87%)	99.09	98.91
SRR546168	SAMN01121759	137295	124995 (91.04%)	90884 (66.19%)	99.18	98.79
SRR546170	SAMN01121760	166934	151990 (91.04%)	108703 (65.11%)	99.13	99.12
SRR546172	SAMN01121761	171079	157734 (92.19%)	113398 (66.28%)	99.16	99.04

Protein alignments

The alignments of the following proteins with ProSplign were used for gene prediction:

Source	Number of sequences retrieved from Entrez	Number (%) of sequences aligned by ProSplign	Number (%) of sequences passed to Gnomon	Average % identity	Average % coverage
Same-species GenBank	144	144 (100.00%)	144 (100.00%)	77.20%	90.44%
Theobroma cacao high-quality model RefSeq (XP_)	13,536	13,066 (96.53%)	13,066 (96.53%)	69.44%	79.93%
Cucurbita maxima high-quality model RefSeq (XP_)	18,976	18,562 (97.82%)	18,562 (97.82%)	69.80%	78.57%
Arabidopsis thaliana known RefSeq (NP_)	48,147	41,910 (87.05%)	41,910 (87.05%)	67.10%	72.03%
Rosales GenBank	9,553	9,134 (95.61%)	9,134 (95.61%)	70.24%	82.37%
Rosales known RefSeq (NP_)	956	939 (98.22%)	939 (98.22%)	69.66%	80.62%
Malus domestica high-quality model RefSeq (XP_)	28,188	26,165 (92.82%)	26,165 (92.82%)	69.36%	78.50%
Nelumbo nucifera high-quality model RefSeq (XP_)	14,295	13,855 (96.92%)	13,855 (96.92%)	69.59%	78.98%

References

RefSeq: Pruitt KD, Brown GR, Hiatt SM, Thibaud-Nissen F, Astashyn A, Ermolaeva O, Farrell CM, Hart J, Landrum MJ, McGarvey KM, Murphy MR, O'Leary NA, Pujar S, Rajput B, Rangwala SH, Riddick LD, Shkeda A, Sun H, Tamez P, Tully RE, Wallin C, Webb D, Weber J, Wu W, Dicuccio M, Kitts P, Maglott DR, Murphy TD, Ostell JM. Nucleic Acids Research 2014, 42(Database issue):D756-63
BUSCO: Manni M, Berkeley MR, Seppey M, Simão FA, Zdobnov EM. Molecular biology and evolution 2021.38(10):4647-4654
RepeatMasker: Smit AFA, Hubley R, Green P. RepeatMasker Open-3.0. 1996–2004. http://www.repeatmasker.org
WindowMasker: Morgulis A, Gertz EM, Schäffer AA, Agarwala R. Bioinformatics 2006, 2:134-41
Splign: Kapustin Y, Souvorov A, Tatusova T, Lipman D. Biology Direct 2008, 3:20
STAR: Dobin A, Davis CA, Schlesinger F, Drenkow J, Zaleski C, Jha S, Batut P, Chaisson M, Gingeras TR. Bioinformatics 2013 Jan 1;29(1):15-21.
Minimap2: Li H. Bioinformatics 2018 Sep 15;34(18):3094-3100

RefSeq

Integrated reference sequences