EUtils
EUtils provide a interface to Entrez databases at NCBI. The APIs are defined in the BioServices.EUtils
module, which exports nine functions to access its databases:
Function | Description |
---|---|
einfo | Retrieve a list of databases or statistics for a database. |
esearch | Retrieve a list of UIDs matching a text query. |
epost | Upload or append a list of UIDs to the Entrez History server. |
esummary | Retrieve document summaries for a list of UIDs. |
efetch | Retrieve formatted data records for a list of UIDs. |
elink | Retrieve UIDs linked to an input set of UIDs. |
egquery | Retrieve the number of available records in all databases by a text query. |
espell | Retrieve spelling suggestions. |
ecitmatch | Retrieve PubMed IDs that correspond to a set of input citation strings. |
"The Nine E-Utilities in Brief" summarizes all of the server-side programs corresponding to each function.
In this package, queries for databases are controlled by keyword parameters. For example, some functions take db
parameter to specify the target database. Functions listed above take these parameters as keyword arguments and return a Response
object as follows:
julia> using BioServices.EUtils # import the nine functions above
julia> res = einfo(db="pubmed") # retrieve statistics of the PubMed database
Response(200 OK, 18 headers, 27360 bytes in body)
julia> write("pubmed.xml", res.body) # save retrieved data into a file
27360
shell> head pubmed.xml
<?xml version="1.0" encoding="UTF-8" ?>
<!DOCTYPE eInfoResult PUBLIC "-//NLM//DTD einfo 20130322//EN" "https://eutils.ncbi.nlm.nih.gov/eutils/dtd/20130322/einfo.dtd">
<eInfoResult>
<DbInfo>
<DbName>pubmed</DbName>
<MenuName>PubMed</MenuName>
<Description>PubMed bibliographic record</Description>
<DbBuild>Build161024-2207m.1</DbBuild>
<Count>26590895</Count>
<LastUpdate>2016/10/25 02:06</LastUpdate>
Let's see a few more examples of parameters. The term
parameter specifies a search string (e.g. esearch(db="gene", term="tumor AND human[ORGN]")
). The id
parameter specifies a UID (or accession number) or a list of UIDs (e.g. efetch(db="protein", id="NP_000537.3", rettype="fasta")
, efetch(db="snp", id=["rs55863639", "rs587780067"])
). The complete list of parameters can be found at "The E-utilities In-Depth: Parameters, Syntax and More".
When a request succeeds the response object has a body
field containing formatted data, which can be saved to a file as demonstrated above. However, users are often interested in a part of the response data and may want to extract some fields in it. In such a case, EzXML.jl is helpful because it offers lots of tools to handle XML documents. The first thing you need to do is converting the response data into an XML document by parsexml
:
julia> res = efetch(db="nuccore", id="NM_001126.3", retmode="xml")
Response(200 OK, 19 headers, 41536 bytes in body)
julia> using EzXML
julia> doc = parsexml(res.body)
EzXML.Document(EzXML.Node(<DOCUMENT_NODE@0x00007fdd4cc43770>))
Note that because res
is an HTTP Response Message, its content can be read at most once. Running res.body
again will reveal an empty Array.
After that, you can query fields you want using XPath:
julia> seq = findfirst("/GBSet/GBSeq", doc)
EzXML.Node(<ELEMENT_NODE@0x00007fdd49f34b10>)
julia> nodecontent(findfirst("GBSeq_definition", seq))
"Homo sapiens adenylosuccinate synthase (ADSS), mRNA"
julia> nodecontent(findfirst("GBSeq_accession-version", seq))
"NM_001126.3"
julia> length(findall("//GBReference", seq))
10
julia> using Bio.Seq
julia> DNASequence(nodecontent(findfirst("GBSeq_sequence", seq)))
2791nt DNA Sequence:
ACGGGAGTGGCGCGCCAGGCCGCGGAAGGGGCGTGGCCT…TGATTAAAAGAACCAAATATTTCTAGTATGAAAAAAAAA
Every function can take a context dictionary as its first argument to set parameters for a query. Key-value pairs in a context are appended to the query in addition to other parameters passed by keyword arguments. The default context is an empty dictionary that sets no parameters. This context dictionary is especially useful when temporarily caching query UIDs into the Entrez History server. A request to the Entrez system can be associated with cached data using WebEnv
and query_key
parameters. In the following example, the search results of esearch
is saved in the Entrez History server (note usehistory=true
, which makes the server cache its search results) and then their summaries are retrieved in the next call of esummary
:
julia> context = Dict() # create an empty context
Dict{Any,Any} with 0 entries
julia> res = esearch(context, db="pubmed", term="asthma[mesh] AND leukotrienes[mesh] AND 2009[pdat]", usehistory=true)
Response(200 OK, 18 headers, 1574 bytes in body)
julia> context # the context dictionary has been updated
Dict{Any,Any} with 2 entries:
:query_key => "1"
:WebEnv => "NCID_1_9251987_130.14.22.215_9001_1477389145_1960133…
julia> res = esummary(context, db="pubmed") # retrieve summaries in context
Response(200 OK, 18 headers, 135463 bytes in body)
julia> write("asthma_leukotrienes_2009.xml", res.body) # save data into a file
135463
shell> head asthma_leukotrienes_2009.xml
<?xml version="1.0" encoding="UTF-8" ?>
<!DOCTYPE eSummaryResult PUBLIC "-//NLM//DTD esummary v1 20041029//EN" "https://eutils.ncbi.nlm.nih.gov/eutils/dtd/20041029/esummary-v1.dtd">
<eSummaryResult>
<DocSum>
<Id>20113659</Id>
<Item Name="PubDate" Type="Date">2009 Nov</Item>
<Item Name="EPubDate" Type="Date"></Item>
<Item Name="Source" Type="String">Zhongguo Dang Dai Er Ke Za Zhi</Item>
<Item Name="AuthorList" Type="List">
<Item Name="Author" Type="String">He MJ</Item>
XMLDict and LightXML
Along with EZXml
there are also other packages for parsing XML objects. The HTTP object returned by the BioServices.EUtils
module is compatible with all of them.
julia> res = efetch(db="nuccore", id="NM_001126.3", retmode="xml")
Response(200 OK, 19 headers, 41536 bytes in body)
julia> using XMLDict
julia> doc = parse_xml(String(res.body))
XMLDict.XMLDictElement with 1 entry:
"GBSeq" => EzXML.Node(<ELEMENT_NODE[GBSeq]@0x00007fef2271e770>)