Usage Guide

This guide will teach you how to assemble SQL queries using FunSQL.

Test Database

To demonstrate database queries, we need a test database. The database we use here is a tiny 10 person sample of simulated patient data extracted from a much larger CMS DE-SynPuf dataset. For a database engine, we picked SQLite. Using SQLite in a guide is convenient because it does not require a database server to run and allows us to distribute the whole database as a single file. FunSQL supports SQLite and many other database engines. The techniques discussed here are not specific to SQLite and once you learn them, you will be able to apply them to any SQL database.

If you wish to follow along with the guide and run the examples, download the database file:

const URL = "https://github.com/MechanicalRabbit/ohdsi-synpuf-demo/releases/download/20210412/synpuf-10p.sqlite"
const DB = download(URL)

All examples in this guide are tested on each update using the NarrativeTest package. To avoid downloading the database file all the time, we registered the download URL as an artifact and use Pkg.Artifacts API to fetch it:

using Pkg.Artifacts, LazyArtifacts

const DB = joinpath(artifact"synpuf-10p", "synpuf-10p.sqlite")
#-> ⋮

Database Connection

To interact with a SQLite database from Julia code, we need to install the SQLite package:

using Pkg

Pkg.add("SQLite")

Once the package is installed, we can use it to connect to the database:

using SQLite

const conn = SQLite.DB(DB)

Later we will use the conn object to execute database queries.

Database Schema

The data in the test database is stored in the format of the OMOP Common Data Model, an open source database schema for observational healthcare data. In this guide, we will only use a small fragment of the Common Data Model.

Fragment of the OMOP Common Data Model

Before we can start assembling queries with FunSQL, we need to make FunSQL aware of the database schema. For each table in the database, we need to create a corresponding SQLTable object, which encapsulates the name of the table together with the names of the columns.

using FunSQL: SQLTable

The patient data, including basic demographic information, is stored in the table person:

const person =
    SQLTable(:person,
             columns = [:person_id, :year_of_birth, :location_id])

Patient addresses are stored in a separate table location, linked to person by the key column location_id:

const location =
    SQLTable(:location,
             columns = [:location_id, :city, :state])

The bulk of patient data consists of clinical events: visits to healthcare providers, recorded observations, diagnosed conditions, prescribed medications, etc. In this guide we only use two types of events, visits and conditions:

const visit_occurrence =
    SQLTable(:visit_occurrence,
             columns = [:visit_occurrence_id, :person_id,
                        :visit_concept_id,
                        :visit_start_date, :visit_end_date])

const condition_occurrence =
    SQLTable(:condition_occurrence,
             columns = [:condition_occurrence_id, :person_id,
                        :condition_concept_id,
                        :condition_start_date, :condition_end_date])

The specific type of the event (e.g., Inpatient visit or Essential hypertension condition) is indicated using a concept id column, which refers to the concept table:

const concept =
    SQLTable(:concept,
             columns = [:concept_id, :concept_name])

Different concepts may be related to each other. For instance, Essential hypertension is a Hypertensive disorder, which itself is a Disorder of cardiovascular system. Concept relationships are recorded in the corresponding table:

const concept_relationship =
    SQLTable(:concept_relationship,
             columns = [:concept_id_1, :concept_id_2, :relationship_id])

Why FunSQL?

Let us start with clarifying the purpose of FunSQL. Consider a problem:

Find all patients born between 1930 and 1940 and living in Illinois, and for each patient show their current age (by the end of 2020).

The answer can be obtained with the following SQL query:

SELECT p.person_id, 2020 - p.year_of_birth AS age
FROM person p
JOIN location l ON (p.location_id = l.location_id)
WHERE (p.year_of_birth BETWEEN 1930 AND 1940) AND (l.state = 'IL')

The simplest way to incorporate this query into Julia code is to embed it as a string literal:

sql = """
SELECT p.person_id, 2020 - p.year_of_birth AS age
FROM person p
JOIN location l ON (p.location_id = l.location_id)
WHERE (p.year_of_birth BETWEEN 1930 AND 1940) AND (l.state = 'IL')
"""

Using the appropriate database engine API and the connection object created earlier, we can execute this query and get back the answer:

res = DBInterface.execute(conn, sql)
#-> SQLite.Query( … )

As an aside, it is convenient to use the DataFrame interface to show the output of a query in tabular form:

using DataFrames

DataFrame(res)
#=>
1×2 DataFrame
 Row │ person_id  age
     │ Int64      Int64
─────┼──────────────────
   1 │     72120     83
=#

FunSQL introduces an extra step to this workflow. Instead of embedding the SQL query directly into Julia code, we construct a query object:

using FunSQL: As, From, Fun, Get, Join, Select, Where

q = From(person) |>
    Where(Fun.between(Get.year_of_birth, 1930, 1940)) |>
    Join(From(location) |> Where(Get.state .== "IL") |> As(:location),
         on = Get.location_id .== Get.location.location_id) |>
    Select(Get.person_id, :age => 2020 .- Get.year_of_birth)

The value of q is a composite object of type SQLNode. "Composite" means that q is assembled from components (also of type SQLNode), which themselves are either atomic or assembled from smaller components. Different kinds of components are created by SQLNode constructors such as From, Where, Fun, Get, etc.

The actual SQL query is generated by rendering the query object:

using FunSQL: render

sql = render(q, dialect = :sqlite)

print(sql)
#=>
SELECT
  "person_2"."person_id",
  (2020 - "person_2"."year_of_birth") AS "age"
FROM (
  SELECT
    "person_1"."location_id",
    "person_1"."person_id",
    "person_1"."year_of_birth"
  FROM "person" AS "person_1"
  WHERE ("person_1"."year_of_birth" BETWEEN 1930 AND 1940)
) AS "person_2"
JOIN (
  SELECT "location_1"."location_id"
  FROM "location" AS "location_1"
  WHERE ("location_1"."state" = 'IL')
) AS "location_2" ON ("person_2"."location_id" = "location_2"."location_id")
=#

Notice that the render function takes a parameter called dialect. Although the SQL language is standardized, different implementations of SQL tend to deviate from the standard far enough to make them mutually incompatible. For this reason, FunSQL lets us select the target SQL dialect.

At this point, the job of FunSQL is done and, just as before, we can execute the query and display the result:

res = DBInterface.execute(conn, sql)

DataFrame(res)
#=>
1×2 DataFrame
 Row │ person_id  age
     │ Int64      Int64
─────┼──────────────────
   1 │     72120     83
=#

Why, instead of embedding a complete SQL query, we prefer to generate it through a query object? To justify this extra step, consider that in a real Julia program, any query is likely going to be parameterized:

Find all patients born between $start_year and $end_year and living in $states, and for each patient show the $output_columns.

If this is the case, the SQL query cannot be prepared in advance and must be assembled on the fly. While it is possible to assemble a SQL query from string fragments, it is tedious, error-prone and definitely not fun. FunSQL provides a more robust and effective approach: build the query as a composite data structure.

Here is how a parameterized query may be constructed with FunSQL:

function FindPatients(; start_year = nothing,
                        end_year = nothing,
                        states = String[])
    q = From(person) |>
        Where(BirthRange(start_year, end_year))
    if !isempty(states)
        q = q |>
            Join(:location => From(location) |>
                              Where(Fun.in(Get.state, states...)),
                 on = Get.location_id .== Get.location.location_id)
    end
    q
end

function BirthRange(start_year, end_year)
    p = true
    if start_year !== nothing
        p = Fun.and(p, Get.year_of_birth .>= start_year)
    end
    if end_year !== nothing
        p = Fun.and(p, Get.year_of_birth .<= end_year)
    end
    p
end

The function FindPatients effectively becomes a new SQLNode constructor, which can be used directly or as a component of a larger query.

Show all patients.

q = FindPatients()

print(render(q, dialect = :sqlite))
#=>
SELECT
  "person_1"."person_id",
  "person_1"."year_of_birth",
  "person_1"."location_id"
FROM "person" AS "person_1"
=#

Show all patients born in or after 1930.

q = FindPatients(start_year = 1930) |>
    Select(Get.person_id)

print(render(q, dialect = :sqlite))
#=>
SELECT "person_1"."person_id"
FROM "person" AS "person_1"
WHERE ("person_1"."year_of_birth" >= 1930)
=#

Find all patients born between 1930 and 1940 and living in Illinois, and for each patient show their current age.

q = FindPatients(start_year = 1930, end_year = 1940, states = ["IL"]) |>
    Select(Get.person_id, :age => 2020 .- Get.year_of_birth)

print(render(q, dialect = :sqlite))
#=>
SELECT
  "person_2"."person_id",
  (2020 - "person_2"."year_of_birth") AS "age"
FROM (
  SELECT
    "person_1"."location_id",
    "person_1"."person_id",
    "person_1"."year_of_birth"
  FROM "person" AS "person_1"
  WHERE
    ("person_1"."year_of_birth" >= 1930) AND
    ("person_1"."year_of_birth" <= 1940)
) AS "person_2"
JOIN (
  SELECT "location_1"."location_id"
  FROM "location" AS "location_1"
  WHERE ("location_1"."state" IN ('IL'))
) AS "location_2" ON ("person_2"."location_id" = "location_2"."location_id")
=#

Tabular Operations

Recall the query from the previous section:

Find all patients born between 1930 and 1940 and living in Illinois, and for each patient show their current age.

From(person) |>
Where(Fun.between(Get.year_of_birth, 1930, 1940)) |>
Join(From(location) |> Where(Get.state .== "IL") |> As(:location),
     on = Get.location_id .== Get.location.location_id) |>
Select(Get.person_id, :age => 2020 .- Get.year_of_birth)

At the outer level, this query is constructed from tabular operations From, Where, Join, and Select arranged in a pipeline by the pipe (|>) operator. In SQL, a tabular operation takes a certain number of input datasets and produces an output dataset. It is helpful to visualize a tabular operation as a node with a certain number of input arrows and one output arrow.

From, Where, Select, and Join nodes

Then the whole query can be visualized as a pipeline diagram. Each arrow in this diagram represents a dataset, and each node represents an elementary data processing operation.

Query pipeline

The following tabular operations are available in FunSQL.

Constructor	Function
`Append`	concatenate datasets
`As`	wrap all columns in a nested record
`Bind`	correlate a subquery in a join expression
`Define`	add an output column
`From`	produce the content of a database table
`Group`	partition the dataset into disjoint groups
`Iterate`	iterate a query
`Join`	correlate two datasets
`Limit`	truncate the dataset
`Order`	sort the dataset
`Partition`	add a window to the dataset
`Select`	specify output columns
`Where`	filter the dataset by the given condition
`With`	assign a name to a temporary dataset

`From`, `Select`, and `Define`

The From node outputs the content of a database table. The constructor takes one argument, a SQLTable object (see the section Database Schema). In a query, a bare SQLTable object is automatically converted to a From node, so one could write more compactly:

For each patient, show their current age.

person |>
Select(Get.person_id, :age => 2020 .- Get.year_of_birth)

It is possible for a query not to have a From node:

Show the current date and time.

q = Select(Fun.current_timestamp())

sql = render(q)

print(sql)
#-> SELECT CURRENT_TIMESTAMP AS "current_timestamp"

In this query, the Select node is not connected to any source of data. In such a case, it is supplied with a unit dataset containing one row and no columns. Hence this query will generate one row of output.

The same effect could be achieved with the From(nothing) node.

q = From(nothing) |>
    Select(Fun.current_timestamp())

sql = render(q)

print(sql)
#-> SELECT CURRENT_TIMESTAMP AS "current_timestamp"

In general, the Select node is used to specify the output columns. The name of the column is either derived from the expression or set explicitly with As (or its shorthand, the arrow (=>) operator).

As opposed to SQL, FunSQL does not demand that all queries have an explicit Select. The following query will produce all columns of the table:

Show all patients.

q = From(person)

sql = render(q)

print(sql)
#=>
SELECT
  "person_1"."person_id",
  "person_1"."year_of_birth",
  "person_1"."location_id"
FROM "person" AS "person_1"
=#

To add a new column while preserving existing output columns, we use the Define node.

Show the patient data together with their current age.

using FunSQL: Define

q = From(person) |>
    Define(:age => 2020 .- Get.year_of_birth)

sql = render(q)

print(sql)
#=>
SELECT
  "person_1"."person_id",
  "person_1"."year_of_birth",
  "person_1"."location_id",
  (2020 - "person_1"."year_of_birth") AS "age"
FROM "person" AS "person_1"
=#

Define could also be used to replace an existing column.

Hide the year of birth of patients born before 1930.

q = From(person) |>
    Define(:year_of_birth => Fun.case(Get.year_of_birth .>= 1930,
                                      Get.year_of_birth,
                                      missing))

sql = render(q)

print(sql)
#=>
SELECT
  "person_1"."person_id",
  (CASE WHEN ("person_1"."year_of_birth" >= 1930) THEN "person_1"."year_of_birth" ELSE NULL END) AS "year_of_birth",
  "person_1"."location_id"
FROM "person" AS "person_1"
=#

`Join`

The Join node correlates the rows of two input datasets. Predominantly, Join is used for looking up table records by key. In the following example, Join associates each person record with their location using the key column location_id that uniquely identifies a location record:

Show all patients together with their state of residence.

person |>
Join(:location => location,
     Get.location_id .== Get.location.location_id,
     left = true) |>
Select(Get.person_id, Get.location.state)

The modifier left = true tells Join that it must output all person records including those without the corresponding location. Since this is a very common requirement, FunSQL provides an alias:

using FunSQL: LeftJoin

person |>
LeftJoin(:location => location,
         Get.location_id .== Get.location.location_id) |>
Select(Get.person_id, Get.location.state)

Since Join needs two input datasets, it must be attached to two input pipelines. The first pipeline is attached using the |> operator and the second one is provided as an argument to the Join constructor. Alternatively, both input pipelines can be specified as keyword arguments:

Join(over = person,
     joinee = :location => location,
     on = Get.location_id .== Get.location.location_id,
     left = true) |>
Select(Get.person_id, Get.location.state)

The output of Join combines columns of both input datasets, which will cause ambiguity if both datasets have a column with the same name. Such is the case in the previous example since both tables, person and location, have a column called location_id. To disambiguate them, we can place all columns of one of the datasets into a nested record. This is the action of the arrow (=>) operator or its full form, the As node:

using FunSQL: As

From(person) |>
LeftJoin(From(location) |> As(:location),
         on = Get.location_id .== Get.location.location_id) |>
Select(Get.person_id, Get.location.state)

Alternatively, we could use bound column references, which are described in a later section.

Scalar Operations

Many tabular operations including Join, Select and Where are parameterized with scalar operations. A scalar operation acts on an individual row of a dataset and produces a scalar value. Scalar operations are assembled from literal values, column references, and applications of SQL functions and operators. Below is a list of scalar operations available in FunSQL.

Constructor	Function
`Agg`	apply an aggregate function
`As`	assign a column alias
`Bind`	correlate an inner subquery
`Fun`	apply a scalar function or a scalar operator
`Get`	produce the value of a column
`Lit`	produce a constant value
`Sort`	indicate the sort order
`Var`	produce the value of a query parameter

`Lit`

The Lit node creates a literal value, although we could usually omit the constructor:

using FunSQL: Lit

Select(Lit(42))
Select(42)

The SQL value NULL is represented by the Julia constant missing:

q = Select(missing)

sql = render(q, dialect = :sqlite)

print(sql)
#-> SELECT NULL AS "_"

`Get`

The Get node creates a column reference. The Get constructor admits several equivalent forms:

Get.year_of_birth
Get(:year_of_birth)
Get."year_of_birth"
Get("year_of_birth")

Such column references are resolved at the place of use against the input dataset. As we mentioned earlier, sometimes column references cannot be resolved unambiguously. To alleviate this problem, we can bind the column reference to the node that produces it:

Show all patients with their state of residence.

qₚ = From(person)
qₗ = From(location)
q = qₚ |>
    LeftJoin(qₗ, on = qₚ.location_id .== qₗ.location_id) |>
    Select(qₚ.person_id, qₗ.state)

The notation qₚ.location_id and qₗ.location_id is a syntax sugar for

Get(:location_id, over = qₚ)
Get(:location_id, over = qₗ)

`Fun`

SQL functions and operators are represented using the Fun node, which also has several equivalent forms:

Fun.between(Get.year_of_birth, 1930, 1940)
Fun(:between, Get.year_of_birth, 1930, 1940)
Fun."between"(Get.year_of_birth, 1930, 1940)
Fun("between", Get.year_of_birth, 1930, 1940)

Certain SQL operators, notably comparison operators, can be represented using Julia broadcasting notation:

Fun.">="(Get.year_of_birth, 1930)
Get.year_of_birth .>= 1930

We should note that FunSQL does not verify if a SQL function or an operator is used correctly or even whether it exists or not. In such a case, FunSQL will generate a SQL query that fails to execute:

q = From(person) |>
    Select(Fun.frobnicate(Get.year_of_birth))

sql = render(q, dialect = :sqlite)

print(sql)
#=>
SELECT FROBNICATE("person_1"."year_of_birth") AS "frobnicate"
FROM "person" AS "person_1"
=#

DBInterface.execute(conn, sql)
#-> ERROR: SQLite.SQLiteException("no such function: FROBNICATE")

On the other hand, FunSQL will correctly serialize many SQL functions and operators that have irregular syntax including AND, OR, NOT, IN, EXISTS, CASE, and others:

Show the demographic cohort of each patient.

q = From(person) |>
    Select(Fun.case(Get.year_of_birth .<= 1960, "boomer", "millenial"))

sql = render(q, dialect = :sqlite)

print(sql)
#=>
SELECT (CASE WHEN ("person_1"."year_of_birth" <= 1960) THEN 'boomer' ELSE 'millenial' END) AS "case"
FROM "person" AS "person_1"
=#

`Group` and Aggregate Functions

Group and aggregate functions are used for summarizing data to report totals, averages and so on. We start by applying the Group node to partition the input rows into disjoint groups. We can then use aggregate functions to calculate summary values from the rows of each group. In FunSQL, aggregate functions are created using the Agg node. In the following example, we use the aggregate function Agg.count, which simply counts the number of rows in each group.

Show the number of patients by the year of birth.

using FunSQL: Agg, Group

q = From(person) |>
    Group(Get.year_of_birth) |>
    Select(Get.year_of_birth, Agg.count())

sql = render(q, dialect = :sqlite)

print(sql)
#=>
SELECT
  "person_1"."year_of_birth",
  COUNT(*) AS "count"
FROM "person" AS "person_1"
GROUP BY "person_1"."year_of_birth"
=#

res = DBInterface.execute(conn, sql)

DataFrame(res)
#=>
10×2 DataFrame
 Row │ year_of_birth  count
     │ Int64          Int64
─────┼──────────────────────
   1 │          1911      1
   2 │          1913      1
   3 │          1922      1
⋮
=#

To indicate that aggregate functions must be applied to the dataset as a whole, we create a Group node without arguments. This is the case where FunSQL notation deviates from SQL, where we would omit the GROUP BY clause to achieve the same effect.

Show the average year of birth.

q = From(person) |>
    Group() |>
    Select(Agg.avg(Get.year_of_birth))

sql = render(q, dialect = :sqlite)

print(sql)
#=>
SELECT AVG("person_1"."year_of_birth") AS "avg"
FROM "person" AS "person_1"
=#

res = DBInterface.execute(conn, sql)

DataFrame(res)
#=>
1×1 DataFrame
 Row │ avg
     │ Float64
─────┼─────────
   1 │  1935.4
=#

In general, the arguments of the Group node form the grouping key so that two rows of the input dataset belongs to the same group when they have the same value of the grouping key. The output of Group contains all distinct values of the grouping key.

Show the US states that are present in the location records.

q = From(location) |>
    Group(Get.state)

sql = render(q, dialect = :sqlite)

print(sql)
#=>
SELECT DISTINCT "location_1"."state"
FROM "location" AS "location_1"
=#

res = DBInterface.execute(conn, sql)

DataFrame(res)
#=>
10×1 DataFrame
 Row │ state
     │ String
─────┼────────
   1 │ MI
   2 │ WA
   3 │ FL
⋮
=#

FunSQL has no lexical limitations on the use of aggregate functions. While in SQL aggregate functions can only be used in the SELECT or HAVING clauses, there is no such restriction in FunSQL: they could be used in any context where an ordinary expression is permitted. The only requirement is that for each aggregate function, FunSQL can determine the corresponding Group node. It is convenient to imagine that the output of Group contains the grouped rows, which cannot be observed directly, but whose presence in the output allows us to apply aggregate functions.

In particular, we use a regular Where node where SQL would require a HAVING clause.

Show patients who saw a doctor within the last year.

q = From(visit_occurrence) |>
    Group(Get.person_id) |>
    Where(Agg.max(Get.visit_end_date) .>= Fun.date("now", "-1 year"))

sql = render(q, dialect = :sqlite)

print(sql)
#=>
SELECT "visit_occurrence_1"."person_id"
FROM "visit_occurrence" AS "visit_occurrence_1"
GROUP BY "visit_occurrence_1"."person_id"
HAVING (MAX("visit_occurrence_1"."visit_end_date") >= DATE('now', '-1 year'))
=#

When the output of Group is blocked by an As node, we need to traverse it with Get in order to use an aggregate function.

For each patient, show the date of their latest visit to a doctor.

q = From(person) |>
    LeftJoin(:visit_group => From(visit_occurrence) |> Group(Get.person_id),
             on = Get.person_id .== Get.visit_group.person_id) |>
    Select(Get.person_id,
           Get.visit_group |> Agg.max(Get.visit_start_date))

sql = render(q, dialect = :sqlite)

print(sql)
#=>
SELECT
  "person_1"."person_id",
  "visit_group_1"."max"
FROM "person" AS "person_1"
LEFT JOIN (
  SELECT
    "visit_occurrence_1"."person_id",
    MAX("visit_occurrence_1"."visit_start_date") AS "max"
  FROM "visit_occurrence" AS "visit_occurrence_1"
  GROUP BY "visit_occurrence_1"."person_id"
) AS "visit_group_1" ON ("person_1"."person_id" = "visit_group_1"."person_id")
=#

`Partition` and Window Functions

We can relate each row to other rows in the same dataset using the Partition node and window functions. We start by applying the Partition node to partition the input rows into disjoint groups. The rows in each group are reordered according to the given sort order. Unlike Group, which collapses each row group into a single row, the Partition node preserves the original rows, but allows us to relate each row to adjacent rows in the same partition. In particular, we can apply regular aggregate functions, which calculate the summary value of a subset of rows related to the current row.

In the following example, the rows visit_occurrence are partitioned per patient and ordered by the starting date of the visit. The frame clause specifies the subset of rows relative to the current row (the window frame) to be used by aggregate functions. In this example, the frame contains all rows prior to the current row.

For each visit, show the time passed since the previous visit.

using FunSQL: Partition

q = From(visit_occurrence) |>
    Partition(Get.person_id,
              order_by = [Get.visit_start_date],
              frame = (mode = :rows, start = -Inf, finish = -1)) |>
    Select(Get.person_id,
           Get.visit_start_date,
           Get.visit_end_date,
           :gap => Fun.julianday(Get.visit_start_date) .- Fun.julianday(Agg.max(Get.visit_end_date)))

sql = render(q, dialect = :sqlite)

print(sql)
#=>
SELECT
  "visit_occurrence_1"."person_id",
  "visit_occurrence_1"."visit_start_date",
  "visit_occurrence_1"."visit_end_date",
  (JULIANDAY("visit_occurrence_1"."visit_start_date") - JULIANDAY((MAX("visit_occurrence_1"."visit_end_date") OVER (PARTITION BY "visit_occurrence_1"."person_id" ORDER BY "visit_occurrence_1"."visit_start_date" ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING)))) AS "gap"
FROM "visit_occurrence" AS "visit_occurrence_1"
=#

res = DBInterface.execute(conn, sql)

DataFrame(res)
#=>
27×4 DataFrame
 Row │ person_id  visit_start_date  visit_end_date  gap
     │ Int64      String            String          Float64?
─────┼────────────────────────────────────────────────────────
   1 │      1780  2008-04-09        2008-04-13      missing
   2 │      1780  2008-04-10        2008-04-10           -3.0
   3 │      1780  2008-11-22        2008-11-22          223.0
   4 │      1780  2009-05-22        2009-05-22          181.0
⋮
=#

Query Parameters

A SQL query may include a reference to a query parameter. When we execute such a query, we must supply the actual values for all parameters used in the query. This is a restricted form of dynamic query construction directly supported by SQL syntax.

Show all patients born between $start_year and $end_year.

sql = """
SELECT p.person_id
FROM person p
WHERE p.year_of_birth BETWEEN ? AND ?
"""

res = DBInterface.execute(conn, sql, (1930, 1940))

DataFrame(res)
#=>
3×1 DataFrame
 Row │ person_id
     │ Int64
─────┼───────────
   1 │      1780
   2 │     30091
   3 │     72120
=#

FunSQL can be used to construct a query with parameters. Similar to Get, parameter references are created using the Var node.

using FunSQL: Var

q = From(person) |>
    Where(Fun.between(Get.year_of_birth, Var.start_year, Var.end_year)) |>
    Select(Get.person_id)

sql = render(q, dialect = :sqlite)

print(sql)
#=>
SELECT "person_1"."person_id"
FROM "person" AS "person_1"
WHERE ("person_1"."year_of_birth" BETWEEN ?1 AND ?2)
=#

While we specified parameters by name, in the generated SQL query the same parameters are numbered. If we know the values of the parameters and we wish to execute the query with them, we need to pack the values in the order in which parameters appear in the SQL query:

using FunSQL: pack

params = pack(sql, (start_year = 1930, end_year = 1940))
#-> Any[1930, 1940]

res = DBInterface.execute(conn, sql, params)

DataFrame(res)
#=>
3×1 DataFrame
 Row │ person_id
     │ Int64
─────┼───────────
   1 │      1780
   2 │     30091
   3 │     72120
=#

Correlated Queries

An inner query is a SQL query that is included into the outer query a part of a scalar expression. An inner query must either produce a single value or be used as an argument of a query operator, such as IN or EXISTS, which transforms the query output to a scalar value.

It is easy to assemble an inner query with FunSQL.

Find the oldest patients.

qᵢ = From(person) |>
     Group() |>
     Select(Agg.min(Get.year_of_birth))

qₒ = From(person) |>
     Where(Get.year_of_birth .== qᵢ)

sql = render(qₒ, dialect = :sqlite)

print(sql)
#=>
SELECT
  "person_1"."person_id",
  "person_1"."year_of_birth",
  "person_1"."location_id"
FROM "person" AS "person_1"
WHERE ("person_1"."year_of_birth" = (
  SELECT MIN("person_2"."year_of_birth") AS "min"
  FROM "person" AS "person_2"
))
=#

res = DBInterface.execute(conn, sql)

DataFrame(res)
#=>
1×3 DataFrame
 Row │ person_id  year_of_birth  location_id
     │ Int64      Int64          Int64
─────┼───────────────────────────────────────
   1 │    110862           1911          436
=#

Find patients with no visits to a healthcare provider.

qᵢ = From(visit_occurrence) |>
     Select(Get.person_id)

qₒ = From(person) |>
     Where(Fun."not in"(Get.person_id, qᵢ))

sql = render(qₒ, dialect = :sqlite)

print(sql)
#=>
SELECT
  "person_1"."person_id",
  "person_1"."year_of_birth",
  "person_1"."location_id"
FROM "person" AS "person_1"
WHERE ("person_1"."person_id" NOT IN (
  SELECT "visit_occurrence_1"."person_id"
  FROM "visit_occurrence" AS "visit_occurrence_1"
))
=#

res = DBInterface.execute(conn, sql)

DataFrame(res)
#=>
0×3 DataFrame
=#

An inner query may depend on the data from the outer query. Such inner queries are called correlated. In FunSQL, correlated queries are created using the Bind node. Specifically, in the body of a correlated query we use query parameters to refer to the external data. The Bind node, which wrap the correlated query, binds each parameter to an expression evaluated in the context of the outer query.

Find all visits where at least one condition was diagnosed.

using FunSQL: Bind

CorrelatedCondition(person_id, start_date, end_date) =
    From(condition_occurrence) |>
    Where(Fun.and(Get.person_id .== Var.person_id,
                  Fun.between(Get.condition_start_date, Var.start_date, Var.end_date))) |>
    Bind(:person_id => person_id,
         :start_date => start_date,
         :end_date => end_date)

q = From(visit_occurrence) |>
    Where(Fun.exists(CorrelatedCondition(Get.person_id, Get.visit_start_date, Get.visit_end_date)))

sql = render(q, dialect = :sqlite)

print(sql)
#=>
SELECT
  "visit_occurrence_1"."visit_occurrence_id",
  "visit_occurrence_1"."person_id",
  "visit_occurrence_1"."visit_concept_id",
  "visit_occurrence_1"."visit_start_date",
  "visit_occurrence_1"."visit_end_date"
FROM "visit_occurrence" AS "visit_occurrence_1"
WHERE (EXISTS (
  SELECT NULL
  FROM "condition_occurrence" AS "condition_occurrence_1"
  WHERE
    ("condition_occurrence_1"."person_id" = "visit_occurrence_1"."person_id") AND
    ("condition_occurrence_1"."condition_start_date" BETWEEN "visit_occurrence_1"."visit_start_date" AND "visit_occurrence_1"."visit_end_date")
))
=#

`Order` and `Limit`

The database server emits the output rows in an arbitrary order. In fact, different runs of the same query may produce rows in a different order. To specify a particular order of output rows, we use the Order node.

Show patients ordered by the year of birth.

using FunSQL: Order

q = From(person) |>
    Order(Get.year_of_birth)

sql = render(q, dialect = :sqlite)

print(sql)
#=>
SELECT
  "person_1"."person_id",
  "person_1"."year_of_birth",
  "person_1"."location_id"
FROM "person" AS "person_1"
ORDER BY "person_1"."year_of_birth"
=#

The Asc and the Desc modifiers specify whether to sort the rows in an ascending or in a descending order.

Show patients ordered by the year of birth in the descending order.

using FunSQL: Desc

q = From(person) |>
    Order(Get.year_of_birth |> Desc())

sql = render(q, dialect = :sqlite)

print(sql)
#=>
SELECT
  "person_1"."person_id",
  "person_1"."year_of_birth",
  "person_1"."location_id"
FROM "person" AS "person_1"
ORDER BY "person_1"."year_of_birth" DESC
=#

The Limit node lets us take a slice of the input dataset. To make the output deterministic, Limit must be applied right after Order.

Show the top three oldest patients.

using FunSQL: Limit

q = From(person) |>
    Order(Get.year_of_birth) |>
    Limit(1:3)

sql = render(q, dialect = :sqlite)

print(sql)
#=>
SELECT
  "person_1"."person_id",
  "person_1"."year_of_birth",
  "person_1"."location_id"
FROM "person" AS "person_1"
ORDER BY "person_1"."year_of_birth"
LIMIT 3
OFFSET 0
=#

`Append` and `Iterate`

The Append node concatenates two or more input datasets. Only the columns that are present in every input dataset will be included to the output.

Show all clinical events (visits and conditions) associated with each patient.

using FunSQL: Append

q = From(visit_occurrence) |>
    Define(:type => "visit", :date => Get.visit_start_date) |>
    Append(From(condition_occurrence) |>
           Define(:type => "condition", :date => Get.condition_start_date)) |>
    Order(Get.person_id, Get.date)

sql = render(q, dialect = :sqlite)

print(sql)
#=>
SELECT
  "union_1"."person_id",
  "union_1"."type",
  "union_1"."date"
FROM (
  SELECT
    "visit_occurrence_1"."person_id",
    'visit' AS "type",
    "visit_occurrence_1"."visit_start_date" AS "date"
  FROM "visit_occurrence" AS "visit_occurrence_1"
  UNION ALL
  SELECT
    "condition_occurrence_1"."person_id",
    'condition' AS "type",
    "condition_occurrence_1"."condition_start_date" AS "date"
  FROM "condition_occurrence" AS "condition_occurrence_1"
) AS "union_1"
ORDER BY
  "union_1"."person_id",
  "union_1"."date"
=#

res = DBInterface.execute(conn, sql)

DataFrame(res)
#=>
53×3 DataFrame
 Row │ person_id  type       date
     │ Int64      String     String
─────┼──────────────────────────────────
   1 │      1780  visit      2008-04-09
   2 │      1780  visit      2008-04-10
   3 │      1780  condition  2008-04-10
   4 │      1780  visit      2008-11-22
   5 │      1780  condition  2008-11-22
   6 │      1780  visit      2009-05-22
   7 │      1780  condition  2009-05-22
   8 │     30091  visit      2008-11-12
  ⋮  │     ⋮          ⋮          ⋮
  47 │    110862  condition  2008-09-07
  48 │    110862  visit      2009-06-30
  49 │    110862  condition  2009-06-30
  50 │    110862  visit      2009-09-30
  51 │    110862  condition  2009-09-30
  52 │    110862  visit      2010-06-07
  53 │    110862  condition  2010-06-07
                         38 rows omitted
=#

For a second example, consider the table concept, which contains the vocabulary of medical concepts (such as Myocardial Infarction). These concepts may be related to each other (Myocardial Infarction has a subtype Acute Myocardial Infarction), and their relationships are stored in the table concept_relationship. We can encapsulate construction of a query that finds immediate subtypes as the function:

SubtypesOf(base) =
    From(concept) |>
    Join(From(concept_relationship) |>
         Where(Get.relationship_id .== "Is a"),
         on = Get.concept_id .== Get.concept_id_1) |>
    Join(:base => base,
         on = Get.concept_id_2 .== Get.base.concept_id)

Show the concept "Myocardial Infarction" and its immediate subtypes.

base = From(concept) |>
       Where(Get.concept_name .== "Myocardial infarction")

q = base |> Append(SubtypesOf(base))

sql = render(q, dialect = :sqlite)

res = DBInterface.execute(conn, sql)

DataFrame(res)
#=>
2×2 DataFrame
 Row │ concept_id  concept_name
     │ Int64       String
─────┼─────────────────────────────────────────
   1 │    4329847  Myocardial infarction
   2 │     312327  Acute myocardial infarction
=#

But how can we fetch not just immediate, but all of the subtypes of a concept?

Show the concept "Myocardial Infarction" and all of its subtypes.

A good start is to repeatedly apply SubtypesOf and concatenate all the outputs:

base |>
Append(SubtypesOf(base),
       SubtypesOf(SubtypesOf(base)),
       SubtypesOf(SubtypesOf(SubtypesOf(base))),
       SubtypesOf(SubtypesOf(SubtypesOf(SubtypesOf(base)))))

However we do not know if 4 iterations of SubtypesOf is enough to fully traverse the concept hierarchy. Ideally, we should continue applying SubtypesOf until the last iteration produces an empty output. This is exactly the action of the Iterate node.

using FunSQL: Iterate

q = base |>
    Iterate(:subtype => SubtypesOf(From(:subtype)))

sql = render(q, dialect = :sqlite)

print(sql)
#=>
WITH RECURSIVE "subtype_1" ("concept_id", "concept_name") AS (
  SELECT
    "concept_1"."concept_id",
    "concept_1"."concept_name"
  FROM "concept" AS "concept_1"
  WHERE ("concept_1"."concept_name" = 'Myocardial infarction')
  UNION ALL
  SELECT
    "concept_2"."concept_id",
    "concept_2"."concept_name"
  FROM "concept" AS "concept_2"
  JOIN (
    SELECT
      "concept_relationship_1"."concept_id_1",
      "concept_relationship_1"."concept_id_2"
    FROM "concept_relationship" AS "concept_relationship_1"
    WHERE ("concept_relationship_1"."relationship_id" = 'Is a')
  ) AS "concept_relationship_2" ON ("concept_2"."concept_id" = "concept_relationship_2"."concept_id_1")
  JOIN "subtype_1" ON ("concept_relationship_2"."concept_id_2" = "subtype_1"."concept_id")
)
SELECT
  "subtype_1"."concept_id",
  "subtype_1"."concept_name"
FROM "subtype_1"
=#

res = DBInterface.execute(conn, sql)

DataFrame(res)
#=>
6×2 DataFrame
 Row │ concept_id  concept_name
     │ Int64       String
─────┼───────────────────────────────────────────────
   1 │    4329847  Myocardial infarction
   2 │     312327  Acute myocardial infarction
   3 │     434376  Acute myocardial infarction of a…
   4 │     438170  Acute myocardial infarction of i…
   5 │     438438  Acute myocardial infarction of a…
   6 │     444406  Acute subendocardial infarction
=#