Package Guide

Installation

To install the package, type ] in the Julia REPL to enter the Pkg mode and run

julia> using Pkg
julia> Pkg.add("https://github.com/xKDR/CRRao.jl")

Tutorial: Frequentist Linear Regression

The goal of the CRRao package is to try to unify calling variety of statistical models under the same API. Note that this is different from what something like StatsAPI.jl is doing; instead of introducing namespaces for development of packages, CRRao tries to call those packages with a uniform API. A very similar package comes from the R world: the Zelig Project.

To see how this API works, we will go over an example in which we'll train a linear regression model with the usual ordinary least squares method (which falls under the category of the frequentist viewpoint of statistics). For our example, we will be working with the mtcars dataset.

We first import the required packages.

julia> using CRRao, RDatasets, StatsPlots, Plots, StatsModels┌ Warning: backend `GR` is not installed.
└ @ Plots ~/.julia/packages/Plots/nqFaB/src/backends.jl:37

Then we import the dataset.

julia> mtcars = dataset("datasets", "mtcars")32×12 DataFrame
 Row │ Model              MPG      Cyl    Disp     HP     DRat     WT       QS ⋯
     │ String31           Float64  Int64  Float64  Int64  Float64  Float64  Fl ⋯
─────┼──────────────────────────────────────────────────────────────────────────
   1 │ Mazda RX4             21.0      6    160.0    110     3.9     2.62      ⋯
   2 │ Mazda RX4 Wag         21.0      6    160.0    110     3.9     2.875
   3 │ Datsun 710            22.8      4    108.0     93     3.85    2.32
   4 │ Hornet 4 Drive        21.4      6    258.0    110     3.08    3.215
   5 │ Hornet Sportabout     18.7      8    360.0    175     3.15    3.44      ⋯
   6 │ Valiant               18.1      6    225.0    105     2.76    3.46
   7 │ Duster 360            14.3      8    360.0    245     3.21    3.57
   8 │ Merc 240D             24.4      4    146.7     62     3.69    3.19
  ⋮  │         ⋮             ⋮       ⋮       ⋮       ⋮       ⋮        ⋮        ⋱
  26 │ Fiat X1-9             27.3      4     79.0     66     4.08    1.935     ⋯
  27 │ Porsche 914-2         26.0      4    120.3     91     4.43    2.14
  28 │ Lotus Europa          30.4      4     95.1    113     3.77    1.513
  29 │ Ford Pantera L        15.8      8    351.0    264     4.22    3.17
  30 │ Ferrari Dino          19.7      6    145.0    175     3.62    2.77      ⋯
  31 │ Maserati Bora         15.0      8    301.0    335     3.54    3.57
  32 │ Volvo 142E            21.4      4    121.0    109     4.11    2.78
                                                   5 columns and 17 rows omitted

This dataset has 11 columns (barring the index). We want to train a linear regression model to predict MPG of a car from the information contained in the attributes HP, WT and Gear. We can represent this as a formula term of type StatsModels.formula. The formula term will look like

MPG ~ HP + WT + Gear

More information about such terms can be found in the corresponding docs.

Next, we train a linear regression model.

julia> model = fit(@formula(MPG ~ HP + WT + Gear), mtcars, LinearRegression())Model Class: Linear Regression
Likelihood Mode: Gauss
Link Function: Identity
Computing Method: Optimization
────────────────────────────────────────────────────────────────────────────
                  Coef.  Std. Error      t  Pr(>|t|)   Lower 95%   Upper 95%
────────────────────────────────────────────────────────────────────────────
(Intercept)  32.0137     4.63226      6.91    <1e-06  22.5249     41.5024
HP           -0.0367861  0.00989146  -3.72    0.0009  -0.0570478  -0.0165243
WT           -3.19781    0.846546    -3.78    0.0008  -4.93188    -1.46374
Gear          1.01998    0.851408     1.20    0.2410  -0.72405     2.76401
────────────────────────────────────────────────────────────────────────────

As we can see from the output, a table of coefficients has been printed for us. We can now infer other details of the model from the various getter functions that apply to frequentist models. So one can do the following.

julia> coeftable(model)────────────────────────────────────────────────────────────────────────────
                  Coef.  Std. Error      t  Pr(>|t|)   Lower 95%   Upper 95%
────────────────────────────────────────────────────────────────────────────
(Intercept)  32.0137     4.63226      6.91    <1e-06  22.5249     41.5024
HP           -0.0367861  0.00989146  -3.72    0.0009  -0.0570478  -0.0165243
WT           -3.19781    0.846546    -3.78    0.0008  -4.93188    -1.46374
Gear          1.01998    0.851408     1.20    0.2410  -0.72405     2.76401
────────────────────────────────────────────────────────────────────────────
julia> sigma(model)2.5741691724978972

We can also get the predicted response of the model, along with other measures like the vector of Cook's distances using the predict and cooksdistance functions exported by CRRao. Here's a plot of the vector of Cook's distances.

plot(cooksdistance(model))

To understand more about these functions and in general how frequentist models work in CRRao along with a complete set of getter functions that can be used, please visit the section of the API reference on Frequentist Regression Models.

Tutorial: Bayesian Logistic Regression

Next, let's see an example of doing bayesian statistical inference with CRRao. In this example, we will perform bayesian logistic regression on the turnout dataset from R's Zelig. Further, we will use the Logit link function with a Ridge prior (Prior_Ridge).

With this example, we'll also showcase how to use random number generators to get reproducible results. For this, we will use the StableRNGs package (although any random number generator can be used). So, first we import the required modules.

julia> using CRRao, RDatasets, StableRNGs, StatsModels

Then, we use a StableRNG with random seed 123 as our random number generator.

julia> CRRao.set_rng(StableRNG(123))StableRNGs.LehmerRNG(state=0x000000000000000000000000000000f7)

(Check the documentation of the CRRao.set_rng method for more details). Next, let us load our dataset.

julia> turnout = dataset("Zelig", "turnout")2000×5 DataFrame
  Row │ Race   Age    Educate  Income   Vote
      │ Cat…   Int32  Float64  Float64  Int32
──────┼───────────────────────────────────────
    1 │ white     60     14.0   3.3458      1
    2 │ white     51     10.0   1.8561      0
    3 │ white     24     12.0   0.6304      0
    4 │ white     38      8.0   3.4183      1
    5 │ white     25     12.0   2.7852      1
    6 │ white     67     12.0   2.3866      1
    7 │ white     40     12.0   4.2857      0
    8 │ white     56     10.0   9.3205      1
  ⋮   │   ⋮      ⋮       ⋮        ⋮       ⋮
 1994 │ white     58     12.0   0.1936      0
 1995 │ white     22      7.0   0.2364      0
 1996 │ white     26     16.0   3.3834      0
 1997 │ white     34     12.0   2.917       1
 1998 │ white     51     16.0   7.8949      1
 1999 │ white     22     10.0   2.4811      0
 2000 │ white     59     10.0   0.5523      0
                             1985 rows omitted

And finally, we do the inference using our proposed model.

julia> model = fit(@formula(Vote ~ Age + Race + Income + Educate), turnout, LogisticRegression(), Logit(), Prior_Ridge())Chains MCMC chain (1000×18×1 Array{Float64, 3}):

Iterations        = 501:1:1500
Number of chains  = 1
Samples per chain = 1000
Wall duration     = 89.47 seconds
Compute duration  = 89.47 seconds
parameters        = λ, β[1], β[2], β[3], β[4], β[5]
internals         = lp, n_steps, is_accept, acceptance_rate, log_density, hamiltonian_energy, hamiltonian_energy_error, max_hamiltonian_energy_error, tree_depth, numerical_error, step_size, nom_step_size

Summary Statistics
  parameters      mean       std   naive_se      mcse        ess      rhat   e ⋯
      Symbol   Float64   Float64    Float64   Float64    Float64   Float64     ⋯

           λ    1.5154    0.5936     0.0188    0.0232   650.1558    1.0005     ⋯
        β[1]   -2.8489    0.3136     0.0099    0.0128   526.0745    1.0013     ⋯
        β[2]    0.0270    0.0033     0.0001    0.0001   871.0469    1.0003     ⋯
        β[3]    0.2293    0.1431     0.0045    0.0053   629.1772    0.9993     ⋯
        β[4]    0.1777    0.0266     0.0008    0.0011   647.4307    1.0012     ⋯
        β[5]    0.1665    0.0203     0.0006    0.0008   648.0996    0.9993     ⋯
                                                                1 column omitted

Quantiles
  parameters      2.5%     25.0%     50.0%     75.0%     97.5%
      Symbol   Float64   Float64   Float64   Float64   Float64

           λ    0.7856    1.1200    1.3898    1.7379    3.0078
        β[1]   -3.4941   -3.0498   -2.8372   -2.6271   -2.2694
        β[2]    0.0212    0.0248    0.0269    0.0291    0.0338
        β[3]   -0.0362    0.1337    0.2326    0.3247    0.5048
        β[4]    0.1262    0.1601    0.1771    0.1959    0.2283
        β[5]    0.1295    0.1534    0.1650    0.1797    0.2091