U.S. flag

An official website of the United States government

Use Entrez and Python to search, retrieve, and  parse dbVar records.


1. Search dbVar using Entrez eSearch
2. Retrieve results using eSummary
3. Parse eSummary XML results and print tab delimited output
#### General details on eUtils tools and options along with tutorials and examples 
#### are available on NCBI bookshelf. https://www.ncbi.nlm.nih.gov/books/NBK25499/

first part of Python script:

# You should test that your search return results first on the web 
# https://www.ncbi.nlm.nih.gov/dbvar before using them 
# in your python script.  Available dbVar search terms are on the help page 
# (https://www.ncbi.nlm.nih.gov/dbvar/content/help/#entrezsearch).
# For general Entrez help and boolean search see the online book
# (https://www.ncbi.nlm.nih.gov/books/NBK3837/#EntrezHelp.Entrez_Searching_Options)

# This example will make use of these eUtils History Server parameters
# usehistory, WebEnv, and query_key.  It is highly recommended you use them in
# your pipeline and script.

# /usehistory=/
# When usehistory is set to 'y', ESearch will post the UIDs resulting from the
# search operation onto the History server so that they can be used directly in
# a subsequent E-utility call. Also, usehistory must be set to 'y' for ESearch
# to interpret query key values included in term or to accept a WebEnv as input.

# /WebEnv=/
# Web environment string returned from a previous ESearch, EPost or ELink call.
# When provided, ESearch will post the results of the search operation to this
# pre-existing WebEnv, thereby appending the results to the existing
# environment. In addition, providing WebEnv allows query keys to be used in
# term so that previous search sets can be combined or limited. As described
# above, if WebEnv is used, usehistory must be set to 'y' (ie.
# esearch.fcgi?db=dbvar&term=asthma&WebEnv=<webenv string>&usehistory=y)

# /query_key=/
# Integer query key returned by a previous ESearch, EPost or ELink call. When
# provided, ESearch will find the intersection of the set specified by query_key
# and the set retrieved by the query in term (i.e. joins the two with AND). For
# query_key to function, WebEnv must be assigned an existing WebEnv string and
# usehistory must be set to 'y'.

# load python modules
# May require one time install of biopython and xml2dict.
from Bio import Entrez
import xmltodict

# initialize some default parameters
Entrez.email = 'myemail@ncbi.nlm.nih.gov' # provide your email address
db = 'dbvar'                              # set search to dbVar database
paramEutils = { 'usehistory':'Y' }        # Use Entrez search history to cache results

# generate query to Entrez eSearch
eSearch = Entrez.esearch(db=db, term='("variant"[Object Type] AND estd214)', **paramEutils)

# get eSearch result as dict object
res = Entrez.read(eSearch)

# take a peek of what's in the result (ie. WebEnv, Count, etc.)
for k in res:
    print (k, "=",  res[k])

paramEutils['WebEnv'] = res['WebEnv']         #add WebEnv and query_key to eUtils parameters to request esummary using  
paramEutils['query_key'] = res['QueryKey']    #search history (cache results) instead of using IdList 
paramEutils['rettype'] = 'xml'                #get report as xml
paramEutils['retstart'] = 0                   #get result starting at 0, top of IdList
paramEutils['retmax'] = 5                     #get next five results

# generate request to Entrez eSummary
result = Entrez.esummary(db=db, **paramEutils)
# get xml result
xml = result.read()
# take a peek at xml

peek at xml:

<?xml version="1.0" encoding="UTF-8" ?>
<!DOCTYPE eSummaryResult PUBLIC "-//NLM//DTD esummary dbvar 20170523//EN" "https://eutils.ncbi.nlm.nih.gov/eutils/dtd/20170523/esummary_dbvar.dtd">
<DocumentSummarySet status="OK">

<DocumentSummary uid="48583463">
                        <Assembly>GRCh38 (hg38)</Assembly>
                        <Placement_type>Submitted genomic</Placement_type>
                        <Assembly>NCBI36 (hg18)</Assembly>

second part of Python script:

#convert xml to python dict object for convenient parsing
dsdocs = xmltodict.parse(xml)

#get set of dbVar DocumentSummary (dsdocs) and print report for each (ds)

for ds in dsdocs ['eSummaryResult']['DocumentSummarySet']['DocumentSummary']: 
for p in ds['dbVarPlacementList']['dbVarPlacement']: 
    print (ds['@uid'], ds['ST'], ds['SV'],p['Chr'], p['Chr_start'], p['Chr_end'], p['Chr_inner_start'], p['Chr_inner_end'])


Count = 55722
RetMax = 20
RetStart = 0
QueryKey = 1
WebEnv = NCID_1_77655541_130.14.22.76_9001_1560183999_561741772_0MetA0_S_MegaStore
IdList = ['48583463', '48583462', '48583461', '48583460', '48583459', '48583458', '48583457', '48583456', '48583455', '48583454', '48583453', '48583452', '48583451', '48583450', '48583449', '48583448', '48583447', '48583446', '48583445', '48583444']
TranslationSet = []
TranslationStack = [DictElement({'Term': '"variant"[Object Type]', 'Field': 'Object Type', 'Count': '5535034', 'Explode': 'N'}, attributes={}), DictElement({'Term': 'nstd102[All Fields]', 'Field': 'All Fields', 'Count': '55723', 'Explode': 'N'}, attributes={}), 'AND', 'GROUP']
QueryTranslation = "variant"[Object Type] AND nstd102[All Fields]
48583463 nstd102 nsv3972446 3 37017508 37017509 0 0
48583463 nstd102 nsv3972446 3 37058999 37059000 0 0
48583463 nstd102 nsv3972446 3 37034003 37034004 0 0
48583462 nstd102 nsv3972445 12 6022792 6022793 0 0
48583462 nstd102 nsv3972445 12 6131958 6131959 0 0
48583462 nstd102 nsv3972445 12 6002219 6002220 0 0
48583461 nstd102 nsv3972444 2 219570775 219570776 0 0
48583461 nstd102 nsv3972444 2 220435497 220435498 0 0
48583461 nstd102 nsv3972444 2 220143741 220143742 0 0
48583460 nstd102 nsv3972443 1 109610052 109610058 0 0
48583460 nstd102 nsv3972443 1 110152674 110152680 0 0
48583460 nstd102 nsv3972443 1 109954197 109954203 0 0
48583459 nstd102 nsv3972442 9 35092494 35092495 0 0
48583459 nstd102 nsv3972442 9 35092491 35092492 0 0
48583459 nstd102 nsv3972442 9 35082491 35082492 0 0
Support Center

Last updated: 2019-06-10T23:04:27Z