EUtils

EUtils provide a interface to Entrez databases at NCBI. The APIs are defined in the BioServices.EUtils module, which exports nine functions to access its databases:

Function	Description
`einfo`	Retrieve a list of databases or statistics for a database.
`esearch`	Retrieve a list of UIDs matching a text query.
`epost`	Upload or append a list of UIDs to the Entrez History server.
`esummary`	Retrieve document summaries for a list of UIDs.
`efetch`	Retrieve formatted data records for a list of UIDs.
`elink`	Retrieve UIDs linked to an input set of UIDs.
`egquery`	Retrieve the number of available records in all databases by a text query.
`espell`	Retrieve spelling suggestions.
`ecitmatch`	Retrieve PubMed IDs that correspond to a set of input citation strings.

"The Nine E-Utilities in Brief" summarizes all of the server-side programs corresponding to each function.

In this package, queries for databases are controlled by keyword parameters. For example, some functions take db parameter to specify the target database. Functions listed above take these parameters as keyword arguments and return a Response object as follows:

julia> using BioServices.EUtils       # import the nine functions above

julia> res = einfo(db="pubmed")       # retrieve statistics of the PubMed database
Response(200 OK, 18 headers, 27360 bytes in body)

julia> write("pubmed.xml", res.body)  # save retrieved data into a file
27360

shell> head pubmed.xml
<?xml version="1.0" encoding="UTF-8" ?>
<!DOCTYPE eInfoResult PUBLIC "-//NLM//DTD einfo 20130322//EN" "https://eutils.ncbi.nlm.nih.gov/eutils/dtd/20130322/einfo.dtd">
<eInfoResult>
        <DbInfo>
        <DbName>pubmed</DbName>
        <MenuName>PubMed</MenuName>
        <Description>PubMed bibliographic record</Description>
        <DbBuild>Build161024-2207m.1</DbBuild>
        <Count>26590895</Count>
        <LastUpdate>2016/10/25 02:06</LastUpdate>

Let's see a few more examples of parameters. The term parameter specifies a search string (e.g. esearch(db="gene", term="tumor AND human[ORGN]")). The id parameter specifies a UID (or accession number) or a list of UIDs (e.g. efetch(db="protein", id="NP_000537.3", rettype="fasta"), efetch(db="snp", id=["rs55863639", "rs587780067"])). The complete list of parameters can be found at "The E-utilities In-Depth: Parameters, Syntax and More".

When a request succeeds the response object has a body field containing formatted data, which can be saved to a file as demonstrated above. However, users are often interested in a part of the response data and may want to extract some fields in it. In such a case, EzXML.jl is helpful because it offers lots of tools to handle XML documents. The first thing you need to do is converting the response data into an XML document by parsexml:

julia> res = efetch(db="nuccore", id="NM_001126.3", retmode="xml")
Response(200 OK, 19 headers, 41536 bytes in body)

julia> using EzXML

julia> doc = parsexml(res.body)
EzXML.Document(EzXML.Node(<DOCUMENT_NODE@0x00007fdd4cc43770>))

Note that because res is an HTTP Response Message, its content can be read at most once. Running res.body again will reveal an empty Array.

After that, you can query fields you want using XPath:

julia> seq = findfirst("/GBSet/GBSeq", doc)
EzXML.Node(<ELEMENT_NODE@0x00007fdd49f34b10>)

julia> nodecontent(findfirst("GBSeq_definition", seq))
"Homo sapiens adenylosuccinate synthase (ADSS), mRNA"

julia> nodecontent(findfirst("GBSeq_accession-version", seq))
"NM_001126.3"

julia> length(findall("//GBReference", seq))
10

julia> using Bio.Seq

julia> DNASequence(nodecontent(findfirst("GBSeq_sequence", seq)))
2791nt DNA Sequence:
ACGGGAGTGGCGCGCCAGGCCGCGGAAGGGGCGTGGCCT…TGATTAAAAGAACCAAATATTTCTAGTATGAAAAAAAAA

Every function can take a context dictionary as its first argument to set parameters for a query. Key-value pairs in a context are appended to the query in addition to other parameters passed by keyword arguments. The default context is an empty dictionary that sets no parameters. This context dictionary is especially useful when temporarily caching query UIDs into the Entrez History server. A request to the Entrez system can be associated with cached data using WebEnv and query_key parameters. In the following example, the search results of esearch is saved in the Entrez History server (note usehistory=true, which makes the server cache its search results) and then their summaries are retrieved in the next call of esummary:

julia> context = Dict()  # create an empty context
Dict{Any,Any} with 0 entries

julia> res = esearch(context, db="pubmed", term="asthma[mesh] AND leukotrienes[mesh] AND 2009[pdat]", usehistory=true)
Response(200 OK, 18 headers, 1574 bytes in body)

julia> context  # the context dictionary has been updated
Dict{Any,Any} with 2 entries:
  :query_key => "1"
  :WebEnv    => "NCID_1_9251987_130.14.22.215_9001_1477389145_1960133…

julia> res = esummary(context, db="pubmed")  # retrieve summaries in context
Response(200 OK, 18 headers, 135463 bytes in body)

julia> write("asthma_leukotrienes_2009.xml", res.body)  # save data into a file
135463

shell> head asthma_leukotrienes_2009.xml
<?xml version="1.0" encoding="UTF-8" ?>
<!DOCTYPE eSummaryResult PUBLIC "-//NLM//DTD esummary v1 20041029//EN" "https://eutils.ncbi.nlm.nih.gov/eutils/dtd/20041029/esummary-v1.dtd">
<eSummaryResult>
<DocSum>
        <Id>20113659</Id>
        <Item Name="PubDate" Type="Date">2009 Nov</Item>
        <Item Name="EPubDate" Type="Date"></Item>
        <Item Name="Source" Type="String">Zhongguo Dang Dai Er Ke Za Zhi</Item>
        <Item Name="AuthorList" Type="List">
                <Item Name="Author" Type="String">He MJ</Item>

XMLDict and LightXML

Along with EZXml there are also other packages for parsing XML objects. The HTTP object returned by the BioServices.EUtils module is compatible with all of them.

julia> res = efetch(db="nuccore", id="NM_001126.3", retmode="xml")
Response(200 OK, 19 headers, 41536 bytes in body)

julia> using XMLDict

julia> doc = parse_xml(String(res.body))
XMLDict.XMLDictElement with 1 entry:
  "GBSeq" => EzXML.Node(<ELEMENT_NODE[GBSeq]@0x00007fef2271e770>)