Prokaryote gene location report

Prokaryote gene location record identifiers, organism, and genomic locations

Prokaryote gene location report

Prokaryote gene location record identifiers, organism, and genomic locations

The downloaded prokaryote package contains a prokaryote gene location data report in JSON Lines format in the file:

ncbi_dataset/data/annotation_report.jsonl

Each line of the prokaryote gene location data report file is a hierarchical JSON object that represents a single prokaryote gene location record. The schema of the prokaryote gene location record is defined in the tables below where each row describes a single field in the report or a sub-structure, which is a collection of fields. The outermost structure of the report is ProkaryoteGeneLocation.

Table fields that include a Table Field Mnemonic can be used with the dataformat command-line tool's --fields option. Refer to the dataformat CLI tool reference to see how you can use this tool to transform prokaryote gene location data reports from JSON Lines to tabular formats.

Sample report

{
  "chromosomeName": "plasmid:p2017-45-35",
  "genbankGenomicLocation": {
    "assemblyAccession": "GCA_030179155.1",
    "sequenceRange": {
      "accessionVersion": "CP109734.1",
      "range": [
        {
          "begin": "37360",
          "end": "37773",
          "orientation": "minus"
        }
      ]
    }
  },
  "organism": {
    "organismName": "Providencia thailandensis",
    "strain": "2017-45-35",
    "taxId": 990144
  },
  "proteinAccession": "WP_001435165.1",
  "refseqGenomicLocation": {
    "assemblyAccession": "GCF_030179155.1",
    "sequenceRange": {
      "accessionVersion": "NZ_CP109734.1",
      "range": [
        {
          "begin": "37360",
          "end": "37773",
          "orientation": "minus"
        }
      ]
    }
  }
}

ProkaryoteGeneLocation Structure

FieldTable Field MnemonicTable Column NameTypeDescriptionExamples
proteinAccessionprotein-accessionProtein AccessionstringThe RefSeq WP_ prefixed accession for the protein sequence.WP_000443665.1
refseqGenomicLocationrefseq-genomic-location-RefSeq Genomic LocationSeqRangeWithAssemblyThe RefSeq nucleotide mapping for this protein
genbankGenomicLocationgenbank-genomic-location-GenBank Genomic LocationSeqRangeWithAssemblyThe equivalent GenBank nucleotide mapping for this protein
organismorganism-OrganismOrganismThe species level taxonomy information
completenesscompletenessCompletenessProkaryoteGeneLocation.CompletenessWhether the assembly is complete or partial
chromosomeNamechromosome_nameChromosomestringThe name of the chromosome, if there is one.

InfraspecificNames Structure

FieldTable Field MnemonicTable Column NameTypeDescriptionExamples
breedbreedBreedstringA homogenous group of animals within a domesticated speciesHereford
boxer
cultivarcultivarCultivarstringA variety of plant within a species produced and maintained by cultivationB73
ecotypeecotypeEcotypestringA population or subspecies occupying a distinct habitatAlpine
isolateisolateIsolatestringThe individual isolate from which the sequences in the genome assembly were derivedL1 Dominette 01449 registration number 42190680
Pmale09
sexsexSexstringMale or femalefemale
strainstrainStrainstringA genetic variant, subtype or culture within a speciesSE11

LineageOrganism Structure

FieldTable Field MnemonicTable Column NameTypeDescriptionExamples
taxIdcoming sooncoming soonuint32NCBI Taxonomy identifier11118
namecoming sooncoming soonstringScientific nameCoronaviridae

Organism Structure

FieldTable Field MnemonicTable Column NameTypeDescriptionExamples
taxIdtax-idTaxonomic IDuint32NCBI Taxonomy identifier9606
2697049
organismNamenameNamestringScientific nameHomo sapiens
Severe acute respiratory syndrome coronavirus 2
commonNamecommon-nameCommon NamestringCommon namehuman
pangolin
MERS
SARS2
lineage repeatedLineageOrganismLineage ordered from superkingdom level to increasingly more specific taxonomic entries
pangolinClassificationpangolinPangolin ClassificationstringB.1.1.7
infraspecificNamesinfraspecific-Infraspecific NamesInfraspecificNames

Range Structure

A 1-based range on a sequence record.

FieldTable Field MnemonicTable Column NameTypeDescriptionExamples
beginstartStartuint64
endstopStopuint64
orientationorientationOrientationOrientation
orderorderOrderuint32
ribosomalSlippagecoming sooncoming soonint32When ribosomal slippage is desired, fill out slippage amount between this and previous range.

SeqRangeSet Structure

FieldTable Field MnemonicTable Column NameTypeDescriptionExamples
accessionVersionaccessionSequence AccessionstringNCBI Accession.version of the sequence
range repeatedrange-RangeSeries of intervals on above accession_version

SeqRangeWithAssembly Structure

FieldTable Field MnemonicTable Column NameTypeDescriptionExamples
assemblyAccessionassembly-accessionAssembly AccessionstringThe genomic assembly associated with the sequence location of this proteinGCF_000010385.1
sequenceRangeseq-range-SeqRangeSetThe genomic sequence location of this protein

Orientation Enumeration

NameNumberDescription
none0
plus1
minus2

ProkaryoteGeneLocation.Completeness Enumeration

NameNumberDescription
complete0
partial1

Scalar Value Types

Protocol buffers typeNotesC++PythonJavaGo
doubledoublefloatdoublefloat64
floatfloatfloatfloatfloat32
int32Uses variable-length encoding. Inefficient for encoding negative numbers – if your field is likely to have negative values, use sint32 instead.int32intintint32
int64Uses variable-length encoding. Inefficient for encoding negative numbers – if your field is likely to have negative values, use sint64 instead.int64int/longlongint64
uint32Uses variable-length encoding.uint32int/longintuint32
uint64Uses variable-length encoding.uint64int/longlonguint64
sint32Uses variable-length encoding. Signed int value. These more efficiently encode negative numbers than regular int32s.int32intintint32
sint64Uses variable-length encoding. Signed int value. These more efficiently encode negative numbers than regular int64s.int64int/longlongint64
fixed32Always four bytes. More efficient than uint32 if values are often greater than 2^28.uint32intintuint32
fixed64Always eight bytes. More efficient than uint64 if values are often greater than 2^56.uint64int/longlonguint64
sfixed32Always four bytes.int32intintint32
sfixed64Always eight bytes.int64int/longlongint64
boolboolbooleanbooleanbool
stringA string must always contain UTF-8 encoded or 7-bit ASCII text.stringstr/unicodeStringstring
bytesMay contain any arbitrary sequence of bytes.stringstrByteString[]byte
Generated May 16, 2024