Use Entrez and Python to search, retrieve, and parse dbVar records.
Objectives:
1. Search dbVar using Entrez eSearch
2. Retrieve results using eSummary
3. Parse eSummary XML results and print tab delimited output
#### General details on eUtils tools and options along with tutorials and examples
#### are available on NCBI bookshelf. https://www.ncbi.nlm.nih.gov/books/NBK25499/
first part of Python script:
# You should test that your search return results first on the web
# https://www.ncbi.nlm.nih.gov/dbvar before using them
# in your python script. Available dbVar search terms are on the help page
# (https://www.ncbi.nlm.nih.gov/dbvar/content/help/#entrezsearch).
# For general Entrez help and boolean search see the online book
# (https://www.ncbi.nlm.nih.gov/books/NBK3837/#EntrezHelp.Entrez_Searching_Options)
# This example will make use of these eUtils History Server parameters
# usehistory, WebEnv, and query_key. It is highly recommended you use them in
# your pipeline and script.
# /usehistory=/
# When usehistory is set to 'y', ESearch will post the UIDs resulting from the
# search operation onto the History server so that they can be used directly in
# a subsequent E-utility call. Also, usehistory must be set to 'y' for ESearch
# to interpret query key values included in term or to accept a WebEnv as input.
# /WebEnv=/
# Web environment string returned from a previous ESearch, EPost or ELink call.
# When provided, ESearch will post the results of the search operation to this
# pre-existing WebEnv, thereby appending the results to the existing
# environment. In addition, providing WebEnv allows query keys to be used in
# term so that previous search sets can be combined or limited. As described
# above, if WebEnv is used, usehistory must be set to 'y' (ie.
# esearch.fcgi?db=dbvar&term=asthma&WebEnv=<webenv string>&usehistory=y)
# /query_key=/
# Integer query key returned by a previous ESearch, EPost or ELink call. When
# provided, ESearch will find the intersection of the set specified by query_key
# and the set retrieved by the query in term (i.e. joins the two with AND). For
# query_key to function, WebEnv must be assigned an existing WebEnv string and
# usehistory must be set to 'y'.
# load python modules
# May require one time install of biopython and xml2dict.
from Bio import Entrez
import xmltodict
# initialize some default parameters
Entrez.email = 'myemail@ncbi.nlm.nih.gov' # provide your email address
db = 'dbvar' # set search to dbVar database
paramEutils = { 'usehistory':'Y' } # Use Entrez search history to cache results
# generate query to Entrez eSearch
eSearch = Entrez.esearch(db=db, term='("variant"[Object Type] AND estd214)', **paramEutils)
# get eSearch result as dict object
res = Entrez.read(eSearch)
# take a peek of what's in the result (ie. WebEnv, Count, etc.)
for k in res:
print (k, "=", res[k])
paramEutils['WebEnv'] = res['WebEnv'] #add WebEnv and query_key to eUtils parameters to request esummary using
paramEutils['query_key'] = res['QueryKey'] #search history (cache results) instead of using IdList
paramEutils['rettype'] = 'xml' #get report as xml
paramEutils['retstart'] = 0 #get result starting at 0, top of IdList
paramEutils['retmax'] = 5 #get next five results
# generate request to Entrez eSummary
result = Entrez.esummary(db=db, **paramEutils)
# get xml result
xml = result.read()
# take a peek at xml
print(xml)
peek at xml:
<?xml version="1.0" encoding="UTF-8" ?>
<!DOCTYPE eSummaryResult PUBLIC "-//NLM//DTD esummary dbvar 20170523//EN" "https://eutils.ncbi.nlm.nih.gov/eutils/dtd/20170523/esummary_dbvar.dtd">
<eSummaryResult>
<DocumentSummarySet status="OK">
<DbBuild>Build190602-0830.1</DbBuild>
<DocumentSummary uid="48583463">
<OBJ_TYPE>VARIANT</OBJ_TYPE>
<ST>nstd102</ST>
<SV>nsv3972446</SV>
<Study_type></Study_type>
<Variant_count>0</Variant_count>
<dbVarPublicationList>
</dbVarPublicationList>
<dbVarStudyOrgList>
</dbVarStudyOrgList>
<Method_type_weight>Non-BAC</Method_type_weight>
<Tax_ID>9606</Tax_ID>
<Organism>human</Organism>
<dbVarSubmittedAssemblyList>
</dbVarSubmittedAssemblyList>
<dbVarRemappedAssemblyList>
</dbVarRemappedAssemblyList>
<dbVarPlacementList>
<dbVarPlacement>
<Chr>3</Chr>
<Chr_accession_version>NC_000003.12</Chr_accession_version>
<Contig_accession_version></Contig_accession_version>
<Chr_start>37017508</Chr_start>
<Chr_end>37017509</Chr_end>
<Chr_inner_start>0</Chr_inner_start>
<Chr_inner_end>0</Chr_inner_end>
<Chr_outer_start>0</Chr_outer_start>
<Chr_outer_end>0</Chr_outer_end>
<Assembly>GRCh38 (hg38)</Assembly>
<Assembly_accession>GCF_000001405.26</Assembly_accession>
<Assembly_tax_ID>9606</Assembly_tax_ID>
<Placement_type>Submitted genomic</Placement_type>
</dbVarPlacement>
<dbVarPlacement>
<Chr>3</Chr>
<Chr_accession_version>NC_000003.11</Chr_accession_version>
<Contig_accession_version></Contig_accession_version>
<Chr_start>37058999</Chr_start>
<Chr_end>37059000</Chr_end>
<Chr_inner_start>0</Chr_inner_start>
<Chr_inner_end>0</Chr_inner_end>
<Chr_outer_start>0</Chr_outer_start>
<Chr_outer_end>0</Chr_outer_end>
<Assembly>GRCh37.p13</Assembly>
<Assembly_accession>GCF_000001405.25</Assembly_accession>
<Assembly_tax_ID>9606</Assembly_tax_ID>
<Placement_type>Remapped</Placement_type>
</dbVarPlacement>
<dbVarPlacement>
<Chr>3</Chr>
<Chr_accession_version>NC_000003.10</Chr_accession_version>
<Contig_accession_version></Contig_accession_version>
<Chr_start>37034003</Chr_start>
<Chr_end>37034004</Chr_end>
<Chr_inner_start>0</Chr_inner_start>
<Chr_inner_end>0</Chr_inner_end>
<Chr_outer_start>0</Chr_outer_start>
<Chr_outer_end>0</Chr_outer_end>
<Assembly>NCBI36 (hg18)</Assembly>
<Assembly_accession>GCF_000001405.12</Assembly_accession>
<Assembly_tax_ID>9606</Assembly_tax_ID>
<Placement_type>Remapped</Placement_type>
</dbVarPlacement>
</dbVarPlacementList>
<dbVarGeneList>
<dbVarGene>
<id>4292</id>
<name>MLH1</name>
</dbVarGene>
<dbVarGene>
<id>100131713</id>
<name>RPL29P11</name>
</dbVarGene>
</dbVarGeneList>
<dbVarMethodList>
<string>Multiple</string>
</dbVarMethodList>
<dbVarClinicalSignificanceList>
<string>Pathogenic</string>
</dbVarClinicalSignificanceList>
<dbVarVariantTypeList>
<string>indel</string>
</dbVarVariantTypeList>
<Validation_status_weight>0</Validation_status_weight>
<Variant_call_count>1</Variant_call_count>
<Validation_status></Validation_status>
</DocumentSummary>
...
second part of Python script:
#convert xml to python dict object for convenient parsing
dsdocs = xmltodict.parse(xml)
#get set of dbVar DocumentSummary (dsdocs) and print report for each (ds)
for ds in dsdocs ['eSummaryResult']['DocumentSummarySet']['DocumentSummary']:
for p in ds['dbVarPlacementList']['dbVarPlacement']:
print (ds['@uid'], ds['ST'], ds['SV'],p['Chr'], p['Chr_start'], p['Chr_end'], p['Chr_inner_start'], p['Chr_inner_end'])
output:
Count = 55722
RetMax = 20
RetStart = 0
QueryKey = 1
WebEnv = NCID_1_77655541_130.14.22.76_9001_1560183999_561741772_0MetA0_S_MegaStore
IdList = ['48583463', '48583462', '48583461', '48583460', '48583459', '48583458', '48583457', '48583456', '48583455', '48583454', '48583453', '48583452', '48583451', '48583450', '48583449', '48583448', '48583447', '48583446', '48583445', '48583444']
TranslationSet = []
TranslationStack = [DictElement({'Term': '"variant"[Object Type]', 'Field': 'Object Type', 'Count': '5535034', 'Explode': 'N'}, attributes={}), DictElement({'Term': 'nstd102[All Fields]', 'Field': 'All Fields', 'Count': '55723', 'Explode': 'N'}, attributes={}), 'AND', 'GROUP']
QueryTranslation = "variant"[Object Type] AND nstd102[All Fields]
48583463 nstd102 nsv3972446 3 37017508 37017509 0 0
48583463 nstd102 nsv3972446 3 37058999 37059000 0 0
48583463 nstd102 nsv3972446 3 37034003 37034004 0 0
48583462 nstd102 nsv3972445 12 6022792 6022793 0 0
48583462 nstd102 nsv3972445 12 6131958 6131959 0 0
48583462 nstd102 nsv3972445 12 6002219 6002220 0 0
48583461 nstd102 nsv3972444 2 219570775 219570776 0 0
48583461 nstd102 nsv3972444 2 220435497 220435498 0 0
48583461 nstd102 nsv3972444 2 220143741 220143742 0 0
48583460 nstd102 nsv3972443 1 109610052 109610058 0 0
48583460 nstd102 nsv3972443 1 110152674 110152680 0 0
48583460 nstd102 nsv3972443 1 109954197 109954203 0 0
48583459 nstd102 nsv3972442 9 35092494 35092495 0 0
48583459 nstd102 nsv3972442 9 35092491 35092492 0 0
48583459 nstd102 nsv3972442 9 35082491 35082492 0 0