Use Entrez and Python to search, retrieve, and parse dbVar records.

Objectives:

1. Search dbVar using Entrez eSearch

2. Retrieve results using eSummary

3. Parse eSummary XML results and print tab delimited output

#### General details on eUtils tools and options along with tutorials and examples 
#### are available on NCBI bookshelf. https://www.ncbi.nlm.nih.gov/books/NBK25499/

first part of Python script:

# You should test that your search return results first on the web 
# https://www.ncbi.nlm.nih.gov/dbvar before using them 
# in your python script.  Available dbVar search terms are on the help page 
# (https://www.ncbi.nlm.nih.gov/dbvar/content/help/#entrezsearch).
# For general Entrez help and boolean search see the online book
# (https://www.ncbi.nlm.nih.gov/books/NBK3837/#EntrezHelp.Entrez_Searching_Options)

# This example will make use of these eUtils History Server parameters
# usehistory, WebEnv, and query_key.  It is highly recommended you use them in
# your pipeline and script.

# /usehistory=/
# When usehistory is set to 'y', ESearch will post the UIDs resulting from the
# search operation onto the History server so that they can be used directly in
# a subsequent E-utility call. Also, usehistory must be set to 'y' for ESearch
# to interpret query key values included in term or to accept a WebEnv as input.

# /WebEnv=/
# Web environment string returned from a previous ESearch, EPost or ELink call.
# When provided, ESearch will post the results of the search operation to this
# pre-existing WebEnv, thereby appending the results to the existing
# environment. In addition, providing WebEnv allows query keys to be used in
# term so that previous search sets can be combined or limited. As described
# above, if WebEnv is used, usehistory must be set to 'y' (ie.
# esearch.fcgi?db=dbvar&term=asthma&WebEnv=<webenv string>&usehistory=y)

# /query_key=/
# Integer query key returned by a previous ESearch, EPost or ELink call. When
# provided, ESearch will find the intersection of the set specified by query_key
# and the set retrieved by the query in term (i.e. joins the two with AND). For
# query_key to function, WebEnv must be assigned an existing WebEnv string and
# usehistory must be set to 'y'.

# load python modules
# May require one time install of biopython and xml2dict.
from Bio import Entrez
import xmltodict

# initialize some default parameters
Entrez.email = 'myemail@ncbi.nlm.nih.gov' # provide your email address
db = 'dbvar'                              # set search to dbVar database
paramEutils = { 'usehistory':'Y' }        # Use Entrez search history to cache results

# generate query to Entrez eSearch
eSearch = Entrez.esearch(db=db, term='("variant"[Object Type] AND estd214)', **paramEutils)

# get eSearch result as dict object
res = Entrez.read(eSearch)

# take a peek of what's in the result (ie. WebEnv, Count, etc.)
for k in res:
    print (k, "=",  res[k])

paramEutils['WebEnv'] = res['WebEnv']         #add WebEnv and query_key to eUtils parameters to request esummary using  
paramEutils['query_key'] = res['QueryKey']    #search history (cache results) instead of using IdList 
paramEutils['rettype'] = 'xml'                #get report as xml
paramEutils['retstart'] = 0                   #get result starting at 0, top of IdList
paramEutils['retmax'] = 5                     #get next five results

# generate request to Entrez eSummary
result = Entrez.esummary(db=db, **paramEutils)
# get xml result
xml = result.read()
# take a peek at xml
print(xml)

peek at xml:

<?xml version="1.0" encoding="UTF-8" ?>
<!DOCTYPE eSummaryResult PUBLIC "-//NLM//DTD esummary dbvar 20170523//EN" "https://eutils.ncbi.nlm.nih.gov/eutils/dtd/20170523/esummary_dbvar.dtd">
<eSummaryResult>
<DocumentSummarySet status="OK">
<DbBuild>Build190602-0830.1</DbBuild>

<DocumentSummary uid="48583463">
        <OBJ_TYPE>VARIANT</OBJ_TYPE>
        <ST>nstd102</ST>
        <SV>nsv3972446</SV>
        <Study_type></Study_type>
        <Variant_count>0</Variant_count>
        <dbVarPublicationList>
        </dbVarPublicationList>
        <dbVarStudyOrgList>
        </dbVarStudyOrgList>
        <Method_type_weight>Non-BAC</Method_type_weight>
        <Tax_ID>9606</Tax_ID>
        <Organism>human</Organism>
        <dbVarSubmittedAssemblyList>
        </dbVarSubmittedAssemblyList>
        <dbVarRemappedAssemblyList>
        </dbVarRemappedAssemblyList>
        <dbVarPlacementList>
                <dbVarPlacement>
                        <Chr>3</Chr>
                        <Chr_accession_version>NC_000003.12</Chr_accession_version>
                        <Contig_accession_version></Contig_accession_version>
                        <Chr_start>37017508</Chr_start>
                        <Chr_end>37017509</Chr_end>
                        <Chr_inner_start>0</Chr_inner_start>
                        <Chr_inner_end>0</Chr_inner_end>
                        <Chr_outer_start>0</Chr_outer_start>
                        <Chr_outer_end>0</Chr_outer_end>
                        <Assembly>GRCh38 (hg38)</Assembly>
                        <Assembly_accession>GCF_000001405.26</Assembly_accession>
                        <Assembly_tax_ID>9606</Assembly_tax_ID>
                        <Placement_type>Submitted genomic</Placement_type>
                </dbVarPlacement>
                <dbVarPlacement>
                        <Chr>3</Chr>
                        <Chr_accession_version>NC_000003.11</Chr_accession_version>
                        <Contig_accession_version></Contig_accession_version>
                        <Chr_start>37058999</Chr_start>
                        <Chr_end>37059000</Chr_end>
                        <Chr_inner_start>0</Chr_inner_start>
                        <Chr_inner_end>0</Chr_inner_end>
                        <Chr_outer_start>0</Chr_outer_start>
                        <Chr_outer_end>0</Chr_outer_end>
                        <Assembly>GRCh37.p13</Assembly>
                        <Assembly_accession>GCF_000001405.25</Assembly_accession>
                        <Assembly_tax_ID>9606</Assembly_tax_ID>
                        <Placement_type>Remapped</Placement_type>
                </dbVarPlacement>
                <dbVarPlacement>
                        <Chr>3</Chr>
                        <Chr_accession_version>NC_000003.10</Chr_accession_version>
                        <Contig_accession_version></Contig_accession_version>
                        <Chr_start>37034003</Chr_start>
                        <Chr_end>37034004</Chr_end>
                        <Chr_inner_start>0</Chr_inner_start>
                        <Chr_inner_end>0</Chr_inner_end>
                        <Chr_outer_start>0</Chr_outer_start>
                        <Chr_outer_end>0</Chr_outer_end>
                        <Assembly>NCBI36 (hg18)</Assembly>
                        <Assembly_accession>GCF_000001405.12</Assembly_accession>
                        <Assembly_tax_ID>9606</Assembly_tax_ID>
                        <Placement_type>Remapped</Placement_type>
                </dbVarPlacement>
        </dbVarPlacementList>
        <dbVarGeneList>
                <dbVarGene>
                        <id>4292</id>
                        <name>MLH1</name>
                </dbVarGene>
                <dbVarGene>
                        <id>100131713</id>
                        <name>RPL29P11</name>
                </dbVarGene>
        </dbVarGeneList>
        <dbVarMethodList>
                <string>Multiple</string>
        </dbVarMethodList>
        <dbVarClinicalSignificanceList>
                <string>Pathogenic</string>
        </dbVarClinicalSignificanceList>
        <dbVarVariantTypeList>
                <string>indel</string>
        </dbVarVariantTypeList>
        <Validation_status_weight>0</Validation_status_weight>
        <Variant_call_count>1</Variant_call_count>
        <Validation_status></Validation_status>
</DocumentSummary>
...

second part of Python script:

#convert xml to python dict object for convenient parsing
dsdocs = xmltodict.parse(xml)

#get set of dbVar DocumentSummary (dsdocs) and print report for each (ds)

for ds in dsdocs ['eSummaryResult']['DocumentSummarySet']['DocumentSummary']: 
for p in ds['dbVarPlacementList']['dbVarPlacement']: 
    print (ds['@uid'], ds['ST'], ds['SV'],p['Chr'], p['Chr_start'], p['Chr_end'], p['Chr_inner_start'], p['Chr_inner_end'])

output:

Count = 55722
RetMax = 20
RetStart = 0
QueryKey = 1
WebEnv = NCID_1_77655541_130.14.22.76_9001_1560183999_561741772_0MetA0_S_MegaStore
IdList = ['48583463', '48583462', '48583461', '48583460', '48583459', '48583458', '48583457', '48583456', '48583455', '48583454', '48583453', '48583452', '48583451', '48583450', '48583449', '48583448', '48583447', '48583446', '48583445', '48583444']
TranslationSet = []
TranslationStack = [DictElement({'Term': '"variant"[Object Type]', 'Field': 'Object Type', 'Count': '5535034', 'Explode': 'N'}, attributes={}), DictElement({'Term': 'nstd102[All Fields]', 'Field': 'All Fields', 'Count': '55723', 'Explode': 'N'}, attributes={}), 'AND', 'GROUP']
QueryTranslation = "variant"[Object Type] AND nstd102[All Fields]
48583463 nstd102 nsv3972446 3 37017508 37017509 0 0
48583463 nstd102 nsv3972446 3 37058999 37059000 0 0
48583463 nstd102 nsv3972446 3 37034003 37034004 0 0
48583462 nstd102 nsv3972445 12 6022792 6022793 0 0
48583462 nstd102 nsv3972445 12 6131958 6131959 0 0
48583462 nstd102 nsv3972445 12 6002219 6002220 0 0
48583461 nstd102 nsv3972444 2 219570775 219570776 0 0
48583461 nstd102 nsv3972444 2 220435497 220435498 0 0
48583461 nstd102 nsv3972444 2 220143741 220143742 0 0
48583460 nstd102 nsv3972443 1 109610052 109610058 0 0
48583460 nstd102 nsv3972443 1 110152674 110152680 0 0
48583460 nstd102 nsv3972443 1 109954197 109954203 0 0
48583459 nstd102 nsv3972442 9 35092494 35092495 0 0
48583459 nstd102 nsv3972442 9 35092491 35092492 0 0
48583459 nstd102 nsv3972442 9 35082491 35082492 0 0

dbVar

Database of genomic structural variation

Use Entrez and Python to search, retrieve, and parse dbVar records.

Objectives:

1. Search dbVar using Entrez eSearch

2. Retrieve results using eSummary

3. Parse eSummary XML results and print tab delimited output

first part of Python script:

peek at xml:

second part of Python script:

output: