Package Guide
Installation
To install the package, type ]
in the Julia REPL to enter the Pkg
mode and run
julia> using Pkg
julia> Pkg.add("https://github.com/xKDR/CRRao.jl")
Tutorial: Frequentist Linear Regression
The goal of the CRRao package is to try to unify calling variety of statistical models under the same API. Note that this is different from what something like StatsAPI.jl
is doing; instead of introducing namespaces for development of packages, CRRao tries to call those packages with a uniform API. A very similar package comes from the R world: the Zelig Project.
To see how this API works, we will go over an example in which we'll train a linear regression model with the usual ordinary least squares method (which falls under the category of the frequentist viewpoint of statistics). For our example, we will be working with the mtcars
dataset.
We first import the required packages.
julia> using CRRao, RDatasets, StatsPlots, Plots, StatsModels
┌ Warning: backend `GR` is not installed. └ @ Plots ~/.julia/packages/Plots/nqFaB/src/backends.jl:37
Then we import the dataset.
julia> mtcars = dataset("datasets", "mtcars")
32×12 DataFrame Row │ Model MPG Cyl Disp HP DRat WT QS ⋯ │ String31 Float64 Int64 Float64 Int64 Float64 Float64 Fl ⋯ ─────┼────────────────────────────────────────────────────────────────────────── 1 │ Mazda RX4 21.0 6 160.0 110 3.9 2.62 ⋯ 2 │ Mazda RX4 Wag 21.0 6 160.0 110 3.9 2.875 3 │ Datsun 710 22.8 4 108.0 93 3.85 2.32 4 │ Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 5 │ Hornet Sportabout 18.7 8 360.0 175 3.15 3.44 ⋯ 6 │ Valiant 18.1 6 225.0 105 2.76 3.46 7 │ Duster 360 14.3 8 360.0 245 3.21 3.57 8 │ Merc 240D 24.4 4 146.7 62 3.69 3.19 ⋮ │ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋱ 26 │ Fiat X1-9 27.3 4 79.0 66 4.08 1.935 ⋯ 27 │ Porsche 914-2 26.0 4 120.3 91 4.43 2.14 28 │ Lotus Europa 30.4 4 95.1 113 3.77 1.513 29 │ Ford Pantera L 15.8 8 351.0 264 4.22 3.17 30 │ Ferrari Dino 19.7 6 145.0 175 3.62 2.77 ⋯ 31 │ Maserati Bora 15.0 8 301.0 335 3.54 3.57 32 │ Volvo 142E 21.4 4 121.0 109 4.11 2.78 5 columns and 17 rows omitted
This dataset has 11 columns (barring the index). We want to train a linear regression model to predict MPG
of a car from the information contained in the attributes HP
, WT
and Gear
. We can represent this as a formula term of type StatsModels.formula
. The formula term will look like
MPG ~ HP + WT + Gear
More information about such terms can be found in the corresponding docs.
Next, we train a linear regression model.
julia> model = fit(@formula(MPG ~ HP + WT + Gear), mtcars, LinearRegression())
Model Class: Linear Regression Likelihood Mode: Gauss Link Function: Identity Computing Method: Optimization ──────────────────────────────────────────────────────────────────────────── Coef. Std. Error t Pr(>|t|) Lower 95% Upper 95% ──────────────────────────────────────────────────────────────────────────── (Intercept) 32.0137 4.63226 6.91 <1e-06 22.5249 41.5024 HP -0.0367861 0.00989146 -3.72 0.0009 -0.0570478 -0.0165243 WT -3.19781 0.846546 -3.78 0.0008 -4.93188 -1.46374 Gear 1.01998 0.851408 1.20 0.2410 -0.72405 2.76401 ────────────────────────────────────────────────────────────────────────────
As we can see from the output, a table of coefficients has been printed for us. We can now infer other details of the model from the various getter functions that apply to frequentist models. So one can do the following.
julia> coeftable(model)
──────────────────────────────────────────────────────────────────────────── Coef. Std. Error t Pr(>|t|) Lower 95% Upper 95% ──────────────────────────────────────────────────────────────────────────── (Intercept) 32.0137 4.63226 6.91 <1e-06 22.5249 41.5024 HP -0.0367861 0.00989146 -3.72 0.0009 -0.0570478 -0.0165243 WT -3.19781 0.846546 -3.78 0.0008 -4.93188 -1.46374 Gear 1.01998 0.851408 1.20 0.2410 -0.72405 2.76401 ────────────────────────────────────────────────────────────────────────────
julia> sigma(model)
2.5741691724978972
We can also get the predicted response of the model, along with other measures like the vector of Cook's distances using the predict
and cooksdistance
functions exported by CRRao. Here's a plot of the vector of Cook's distances.
plot(cooksdistance(model))
To understand more about these functions and in general how frequentist models work in CRRao along with a complete set of getter functions that can be used, please visit the section of the API reference on Frequentist Regression Models.
Tutorial: Bayesian Logistic Regression
Next, let's see an example of doing bayesian statistical inference with CRRao. In this example, we will perform bayesian logistic regression on the turnout
dataset from R's Zelig. Further, we will use the Logit
link function with a Ridge prior (Prior_Ridge
).
With this example, we'll also showcase how to use random number generators to get reproducible results. For this, we will use the StableRNGs package (although any random number generator can be used). So, first we import the required modules.
julia> using CRRao, RDatasets, StableRNGs, StatsModels
Then, we use a StableRNG
with random seed 123 as our random number generator.
julia> CRRao.set_rng(StableRNG(123))
StableRNGs.LehmerRNG(state=0x000000000000000000000000000000f7)
(Check the documentation of the CRRao.set_rng
method for more details). Next, let us load our dataset.
julia> turnout = dataset("Zelig", "turnout")
2000×5 DataFrame Row │ Race Age Educate Income Vote │ Cat… Int32 Float64 Float64 Int32 ──────┼─────────────────────────────────────── 1 │ white 60 14.0 3.3458 1 2 │ white 51 10.0 1.8561 0 3 │ white 24 12.0 0.6304 0 4 │ white 38 8.0 3.4183 1 5 │ white 25 12.0 2.7852 1 6 │ white 67 12.0 2.3866 1 7 │ white 40 12.0 4.2857 0 8 │ white 56 10.0 9.3205 1 ⋮ │ ⋮ ⋮ ⋮ ⋮ ⋮ 1994 │ white 58 12.0 0.1936 0 1995 │ white 22 7.0 0.2364 0 1996 │ white 26 16.0 3.3834 0 1997 │ white 34 12.0 2.917 1 1998 │ white 51 16.0 7.8949 1 1999 │ white 22 10.0 2.4811 0 2000 │ white 59 10.0 0.5523 0 1985 rows omitted
And finally, we do the inference using our proposed model.
julia> model = fit(@formula(Vote ~ Age + Race + Income + Educate), turnout, LogisticRegression(), Logit(), Prior_Ridge())
Chains MCMC chain (1000×18×1 Array{Float64, 3}): Iterations = 501:1:1500 Number of chains = 1 Samples per chain = 1000 Wall duration = 89.47 seconds Compute duration = 89.47 seconds parameters = λ, β[1], β[2], β[3], β[4], β[5] internals = lp, n_steps, is_accept, acceptance_rate, log_density, hamiltonian_energy, hamiltonian_energy_error, max_hamiltonian_energy_error, tree_depth, numerical_error, step_size, nom_step_size Summary Statistics parameters mean std naive_se mcse ess rhat e ⋯ Symbol Float64 Float64 Float64 Float64 Float64 Float64 ⋯ λ 1.5154 0.5936 0.0188 0.0232 650.1558 1.0005 ⋯ β[1] -2.8489 0.3136 0.0099 0.0128 526.0745 1.0013 ⋯ β[2] 0.0270 0.0033 0.0001 0.0001 871.0469 1.0003 ⋯ β[3] 0.2293 0.1431 0.0045 0.0053 629.1772 0.9993 ⋯ β[4] 0.1777 0.0266 0.0008 0.0011 647.4307 1.0012 ⋯ β[5] 0.1665 0.0203 0.0006 0.0008 648.0996 0.9993 ⋯ 1 column omitted Quantiles parameters 2.5% 25.0% 50.0% 75.0% 97.5% Symbol Float64 Float64 Float64 Float64 Float64 λ 0.7856 1.1200 1.3898 1.7379 3.0078 β[1] -3.4941 -3.0498 -2.8372 -2.6271 -2.2694 β[2] 0.0212 0.0248 0.0269 0.0291 0.0338 β[3] -0.0362 0.1337 0.2326 0.3247 0.5048 β[4] 0.1262 0.1601 0.1771 0.1959 0.2283 β[5] 0.1295 0.1534 0.1650 0.1797 0.2091