NCBI Sorex fumeus Annotation Release GCF_029834395.1-RS_2023_10

The genome sequence records for Sorex fumeus RefSeq assembly GCF_029834395.1 (SorFum_2.1) were annotated by the NCBI Eukaryotic Genome Annotation Pipeline, an automated pipeline that annotates genes, transcripts and proteins on draft and finished genome assemblies.

Annotation Release GCF_029834395.1-RS_2023_10 is an update of GCF_029834395.1-RS_2023_05. The known RefSeq transcripts (with NM_ and NR_ prefixes) that were current on Oct 4 2023 were placed on the genome and used to update the annotated features. In addition, model RefSeq predicted in the last full annotation (GCF_029834395.1-RS_2023_05) that were still current on Oct 4 2023 were included in the updated annotation. These models were not re-calculated for this update. For more information on the evidence used for generating the model RefSeq, please consult the report for GCF_029834395.1-RS_2023_05.

The annotation products are available in the sequence databases and on the FTP site.

This report provides:

Annotation Release information: The name of the release, important dates, the software version
Assemblies: A brief description of the annotated assembly(ies)
Gene and feature statistics: The counts and characteristics of the annotated features
BUSCO results: Annotation completeness assessed with BUSCO

For more information on the annotation process, please visit the NCBI Eukaryotic Genome Annotation Pipeline page.

Annotation Release information

This annotation should be referred to as "GCF_029834395.1-RS_2023_10".

Date of Entrez queries for transcripts and proteins: Oct 4 2023
Date of submission of annotation to the public databases: Oct 9 2023
Software version: 10.2

Assemblies

The following assemblies were included in this annotation run:

Assembly name	Assembly accession	Submitter	Assembly date	Reference/Alternate	Assembly content
SorFum_2.1	GCF_029834395.1	Trent University	04-17-2023	Reference	unplaced scaffolds

Gene and feature statistics

Counts and length of annotated features are provided below for each assembly.

Feature counts

Feature	SorFum_2.1
Genes and pseudogenes	30,483
protein-coding	20,380
non-coding	4,534
Transcribed pseudogenes	0
Non-transcribed pseudogenes	5,216
genes with variants	5,404
Immunoglobulin/T-cell receptor gene segments	278
other	75
mRNAs	32,646
fully-supported	28,314
with > 5% ab initio	2,378
partial	296
with filled gap(s)	0
known RefSeq (NM_)	0
model RefSeq (XM_)	32,646
non-coding RNAs	5,050
fully-supported	757
with > 5% ab initio	0
partial	0
with filled gap(s)	0
known RefSeq (NR_)	0
model RefSeq (XR_)	4,519
pseudo transcripts	0
fully-supported	0
with > 5% ab initio	0
partial	0
with filled gap(s)	0
known RefSeq (NR_)	0
model RefSeq (XR_)	0
CDSs	32,924
fully-supported	28,314
with > 5% ab initio	2,543
partial	322
with major correction(s)	715
known RefSeq (NP_)	0
model RefSeq (XP_)	32,646

Detailed reports

The counts below do not include pseudogenes.

Feature lengths

Feature	Count	Mean length (bp)	Median length (bp)	Min length (bp)	Max length (bp)
Genes	24,989	39,781	9,085	51	2,391,823
All transcripts	37,696	2,418	1,824	51	103,779
mRNA	32,646	2,737	2,084	108	103,779
misc_RNA	380	2,465	2,099	156	15,702
tRNA	531	74	73	66	91
lncRNA	377	721	520	149	7,291
snoRNA	1,115	130	130	51	321
snRNA	2,208	114	107	59	199
rRNA	364	369	119	118	4,123
Single-exon transcripts	2,738	1,057	942	108	20,076
coding transcripts (NM_/XM_ )	2,738	1,057	942	108	20,076
CDSs	32,646	1,948	1,410	99	102,543
Exons	211,183	235	132	2	20,076
in coding transcripts (NM_/XM_ )	209,417	234	132	2	20,076
in non-coding transcripts (NR_/XR_ )	4,991	201	124	10	6,960
Introns	189,997	5,779	1,354	30	646,083
in coding transcripts (NM_/XM_ )	188,779	5,739	1,353	30	646,083
in non-coding transcripts (NR_/XR_ )	4,354	7,507	1,595	30	511,258

Transcripts per gene, exons per transcript

	Mean	Median	Min	Max
Number of transcripts per gene	1.52	1	1	50
Number of exons per transcript	10.91	8	1	314

BUSCO analysis of gene annotation

BUSCO v4.1.4 was run in "protein" mode on the annotated gene set picking one longest protein per gene, and run using the laurasiatheria_odb10 lineage dataset. Results are reported for the gene set from the primary assembly unit, and presented in BUSCO notation.

References

RefSeq: Pruitt KD, Brown GR, Hiatt SM, Thibaud-Nissen F, Astashyn A, Ermolaeva O, Farrell CM, Hart J, Landrum MJ, McGarvey KM, Murphy MR, O'Leary NA, Pujar S, Rajput B, Rangwala SH, Riddick LD, Shkeda A, Sun H, Tamez P, Tully RE, Wallin C, Webb D, Weber J, Wu W, Dicuccio M, Kitts P, Maglott DR, Murphy TD, Ostell JM. Nucleic Acids Research 2014, 42(Database issue):D756-63
BUSCO: Manni M, Berkeley MR, Seppey M, Simão FA, Zdobnov EM. Molecular biology and evolution 2021.38(10):4647-4654
RepeatMasker: Smit AFA, Hubley R, Green P. RepeatMasker Open-3.0. 1996–2004. http://www.repeatmasker.org
WindowMasker: Morgulis A, Gertz EM, Schäffer AA, Agarwala R. Bioinformatics 2006, 2:134-41
Splign: Kapustin Y, Souvorov A, Tatusova T, Lipman D. Biology Direct 2008, 3:20
STAR: Dobin A, Davis CA, Schlesinger F, Drenkow J, Zaleski C, Jha S, Batut P, Chaisson M, Gingeras TR. Bioinformatics 2013 Jan 1;29(1):15-21.
Minimap2: Li H. Bioinformatics 2018 Sep 15;34(18):3094-3100

RefSeq

Integrated reference sequences