# DataFitting.jl

`DataFitting`

is a general purpose data-driven model fitting framework for Julia.

(works both on Julia v0.6 and v0.7/1.0)

(previous name of this package was ModelFit.jl)

## Key features

- it handles data of any dimensionality;
- the fitting model is evaluated on a user provided (Cartesian or Linear) domain;
- the fitting model is built up by individual
*components*, either provided by the companion package`DataFittingBasics.jl`

or implemented by the user. All components are combined to evaluate the final model with a standard Julia mathematical expression; - all components results are cached, so that repeated evaluations with the same parameters do not involve further calculations. This is very important to speed up the fitting process when many components are involved;
- User provided components can pre-compute quantities based on the model domain before running the fitting process, and store them in a private structure for future re-use;
- Model parameters can be fixed to a specific value, limited in a interval, or be dynamically calculated using a mathematical expression involving the other parameters;
- it allows fitting multiple data sets;
- it easily allows to use different minimizers, and compare their results and performances. Currently two minimizers are supported (LsqFit and CMPFit);
- provides several facilities for interactive fitting and results displaying.

## Installation

On Julia v0.6:

Pkg.clone("https://github.com/gcalderone/DataFitting.jl.git")

On Julia v0.7/v1.0:

] dev https://github.com/gcalderone/DataFitting.jl

## Simple example

Assume the model to be compared with empirical data has 5 parameters and the foolowing analytical formula:

f(x, p1, p2, p3, p4, p5) = @. (p1 + p2 * x + p3 * x^2 + p4 * sin(p5 * x)) * cos(x)

To simulate a measurement process we'll evaluate the model on a domain and add some random noise to a model realization:

# Actual model parameters:
params = [1, 1.e-3, 1.e-6, 4, 5]
# Domain for model evaluation
x = 1:0.05:10000
# Evaluated model
y = f(x, params...);
# Random noise
rng = MersenneTwister(0);
noise = randn(rng, length(x));

In order to use the `DataFitting`

framework we must create a `Domain`

and a `Measures`

objects to encapsulate the model domain and empirical data:

using DataFitting
dom = Domain(x)
data = Measures(y + noise, 1.)

The second argument to the `Measures`

function is the (1 sigma Gaussian) uncertainty associated to each data sample. It can either be an array with the same shape as the first argument or a scalar. In the latter case all data samples are assumed to have the same uncertainty.

Also, we must create a `Model`

object containing a reference to the analytical formula, and prepare it for evaluation on the domain `dom`

model1 = Model(:comp1 => FuncWrap(f, params...))
prepare!(model1, dom, :comp1)

The `comp1`

symbol is the name we chose to identify the component in the model. Any other valid symbol could have been used.

The parameter initial values are those given in the component constructors. Such values can be changed as follows:

model1[:comp1].p[1].val = 1
model1[:comp1].p[2].val = 1.e-3

etc.

Finally, we are ready to fit the model against empirical data:

result1 = fit!(model1, data)

Note that the `fit!`

function modifies the `model1`

objects as follows:

- it updates the model parameter with the best fit ones;
- it updates the internal cache of component evaluations with those resulting from best fit parameters;
- it updates the evaluation counter for each component (see below);

The procedure outlined above may seem cumbersome at first, however it is key to define very complex models and to improve performances, as shown below.

## Multiple components

A model is typically made up of many *components*, joined toghether with a standard mathematical expression. The previous example can easily be re-implemented by splitting the analytical formula in 3 parts:

f1(x, p1, p2, p3) = @. p1 + p2 * x + p3 * x^2
f2(x, p4, p5) = @. p4 * sin(p5 * x)
f3(x) = cos.(x)
model2 = Model(:comp1 => FuncWrap(f1, params[1], params[2], params[3]),
:comp2 => FuncWrap(f2, params[4], params[5]),
:comp3 => FuncWrap(f3))
prepare!(model2, dom, :((comp1 + comp2) * comp3))
result2 = fit!(model2, data)

Now we used 3 components (named `comp1`

, `comp2`

and `comp3`

) and combined them with the mathematical expression in the last argument of the call to `prepare!`

. **Any** valid mathematical expression is allowed.

Note that we obtained exactly the same result as before, but we **gained a factor ~2** in execution time. Such perfromance improvement is due to the fact that the component evaluations are internally cached, and are re-evaluated only if necessary, i.e. when one of the parameter value is modified by the minimzer algorithm. In this particular case the `comp3`

component, having no free parameter, is evaluated only once.

To check how many time each component and the model are evaluated simply type the name of the `Model`

object in the REPL or call the `show`

method, i.e.: `show(model2)`

, and look at the `Counter`

column. To reset the counters type:

resetcounters!(model2)

## Fitting multiple data sets

`DataFitting`

allows to fit multiple data sets simultaneously.

Suppose you observe the same phenomenon with two different instruments and wish to use both data sets to constrain the model parameters. Here we will simulate a second data sets by adding random noise to the previously calculated model, and creating a second `Measures`

object:

noise = randn(rng, length(x));
data2 = Measures(1.3 * (y + noise), 1.3)

Note that we multiplied all data by a factor 1.3, to simulate a different calibration between the instruments. To take into account such calibration we push a further component into the model, named `calib`

:

push!(model2, :calib, SimpleParam(1))

and prepare the model to evaluate a second expression:

prepare!(model2, dom, :(calib * ((comp1 + comp2) * comp3)))

Note that a call to `prepare!`

modifies the `Model`

object by adding a new (possibly different) expression to be evaluated. If needed, also the `Domain`

object used for the second expression may be different from the first one. The latter posssibility allows, for instance, to fit multiple data sets each observed with different instruments.

Now we can fit both data sets as follows:

result2 = fit!(model2, [data, data2])

## Retrieve results

The results of the fitting are available as a `FitResult`

structure, as returned by the `fit!`

fuction. The structure dump for the `result2`

in the example above is as follows:

dump(result2)
DataFitting.FitResult
fitter: DataFitting.Minimizer DataFitting.Minimizer()
param: DataFitting.HashVector{DataFitting.FitParameter}
keys: Array{Symbol}((6,))
1: Symbol comp1__p1
2: Symbol comp1__p2
3: Symbol comp1__p3
4: Symbol comp2__p1
5: Symbol comp2__p2
6: Symbol calib__val
values: Array{DataFitting.FitParameter}((6,))
1: DataFitting.FitParameter
val: Float64 1.0034771773408524
unc: Float64 0.0029174421450895533
2: DataFitting.FitParameter
val: Float64 0.0010022369285940917
unc: Float64 1.3473265492873467e-6
3: DataFitting.FitParameter
val: Float64 9.996494571427457e-7
unc: Float64 1.3148179368867407e-10
4: DataFitting.FitParameter
val: Float64 4.005022216534921
unc: Float64 0.0013760967902191378
5: DataFitting.FitParameter
val: Float64 4.999999879104177
unc: Float64 5.998373321161513e-8
6: DataFitting.FitParameter
val: Float64 1.299996982192236
unc: Float64 4.979637254678212e-5
ndata: Int64 399962
dof: Int64 399956
cost: Float64 400117.9725389123
status: Symbol Optimal
elapsed: Float64 1.736617397

From this structure the user can retrieve the parameter best fit values and uncertainties, the number of data samples, the number of degrees of freedom, the total chi-squared (`cost`

) and the fitting elapsed time in seconds.

In particular, the best fit value and uncertanty for `p1`

can be retrieved as follows:

println(result2.param[:comp1__p1].val)
println(result2.param[:comp1__p1].unc)

Note that the parameter is identified by using both the component name and parameter name, separated by a double underscore `__`

.

To plot the results use the following arrays:

- Abscissa:
`dom[1]`

. For multi dimensional domains you can use`dom[2]`

,`dom[3]`

, etc.; - Ordinate:
- expression 1, i.e.
`(comp1 + comp2) * comp3`

:`model()`

or`model2(1)`

; - expression 2, i.e.
`calib * ((comp1 + comp2) * comp3)`

:`model2(2)`

; `comp1`

component:`model2(:comp1)`

;`comp2`

component:`model2(:comp2)`

;`comp3`

component:`model2(:comp3)`

;`calib`

component:`model2(:calib)`

;

- expression 1, i.e.

In the following example I will use the Gnuplot.jl package, but the user can choose any other package:

using Gnuplot
@gp xr=(1000, 1010) :-
@gp :- dom[1] model2(:comp1) "w l tit 'comp1'" :-
@gp :- dom[1] model2(:comp2) "w l tit 'comp2'" :-
@gp :- dom[1] model2() "w lines tit 'Model' lw 3"

To evaluate the model with a different parameter value you can pass one (or more) `Pair{Symbol,Number}`

to `model2()`

,i.e.:

@gp :- dom[1] model2(:comp2__p1=>0.) "w lines tit 'p4=0' lw 3"

## The `FuncWrap`

and `SimpleParam`

components

`FuncWrap`

and `SimpleParam`

are the only built-in components of the `DataFitting`

package. Further components are available in the `DataFittingBasics.jl`

package.

`Funcwrap`

The `FuncWrap`

is simply a wrapper to a user defined function of the form `f(x, [y], [z], [further dimensions...], p1, p2, [further parameters...])`

. The `x`

, `y`

, `z`

arguments will be `Vector{Float64}`

with the same number of elements, while `p1`

, `p2`

, etc. will be scalar floats. The function must return a `Vector{Float64}`

(regardless of thenumber of dimensions) with the same number of elements as `x`

. This components works with domains of any dimensionality.

The constructor is defined as follows:

FuncWrap(func::Function, args...)

where `args...`

is a list of numbers.

The parameters are:

`p::Vector{Parameter}`

: vector of parameters for the user defined function.

`SimpleParam`

The `SimpleParam`

represent a scalar component in the model, whose value is given by the `val`

parameter. This components works with domains of any dimensionality.

The constructor is defined as follows:

SimpleParam(val::Number)

The parameters are:

`val::Parameter`

: the scalar value.

## Profile a component

The `DataFitting`

framework provides an effective way to check a component performance by means of the `test_component`

function.

For instance, you may wonder if the `FuncWrap`

component used in the previous example as a wrapper to the `f`

function introduces any performance penalty. With `test_component`

you may simply compare the wrapper performance with that of a simple loop, i.e.:

test_component(dom, FuncWrap(f, params...), 1000)
@time for i in 1:1000
dummy = f(x, params...)
end

As you can see there is no significant difference in looping 1000 times on model evaluations or using a simple loop on `f`

.

## Using the `CMPFit`

minimizer

The `DataFitting`

package uses the LsqFit minimizer by default, but it allows to use the CMPFit as well. The latter typically provides better performances with respect to the former, but since `CMPFit`

is a wrapper to a C library it is not a pure-Julia solution. The better performances are due to a different minimization strategy, not to C vs. Julia differences.

To use the `CMPFit`

minimzer (after having installed the package):

using CMPFit
result2 = fit!(model2, [data, data2], minimizer=CMPFit.Minimizer())

## Multidimensional domains

**IMPORTANT NOTE**: by default `the `

DataFitting` package defines structure only for 1D and 2D fitting. To handle higher dimensional cases you should generate the appropriate code with, e.g.:

DataFitting.@code_ndim 3

`N`

-dimensional `Domain`

objects comes in two flavours: linear and cartesian ones:

- Linear domain: contains a
`N`

-dimensional tuple of coordinates, one for each data sample. It is analogous to`Vector{NTuple{N, Number}}`

; - Cartesian domain: contains
`N`

arrays of coordinates, one for each dimension. Optionally contains a 1D list of index to select a subset of all possible combinations of coordinates. It is analogous to`Vector{Vector{Number}}`

, whose length of outer vector is`N`

;

Linear domains are created using the `Domain`

function, providing as many arguments as the number of dimensions. Each argument can either be an integer (specifying how many samples are defined along each axis), or a vector of `float`

s. **The length of all dimensions must be exactly the same.** Examples:

# 1D
dom = Domain(5)
dom = Domain(1.:5)
dom = Domain([1,2,3,4,5.])
# 2D
dom = Domain(5, 5)
dom = Domain(1.:5, [1,2,3,4,5.])

Note that the length of all the above domains is 5.

Cartesian domains are created using the `CartesianDomain`

function, providing as many arguments as the number of dimensions. There is no 1-dimensional cartesian domain, hence `CartesianDomain`

requires at least two arguments. Each argument can either be an integer (specifying how many samples are defined along each axis), or a vector of `float`

s. **The length of dimensions may be different.** Examples:

# 2D
dom = CartesianDomain(5, 6)
dom = CartesianDomain(1.:5, [1,2,3,4,5,6.])

The length of all the above domains is 30, i.e. it is equal to the product of the lengths of all dimensions.

Typically, the model can be evaluated over the cartesian product of all dimensions. However, there can be specific locations of the domain for which there is not empirical data to compare with, making the model evaluation useless. In these cases it is possible to select a subset of the cartesian domain using a 1D linear index, e.g.:

dom = CartesianDomain(1.:5, [1,2,3,4,5,6.], index=collect(0:4) .* 6 .+ 1)

The length of such domain is 5, equal to the length of the vector passed as `index`

keyword.

A cartesian domain can always be transformed into a linear domain, while the inverse is usually not possible. To check how a "flattened" version of a cartesian domain looks like you can use the `DataFitting.flatten`

function, i.e.:

dom = CartesianDomain(1.:5, [1,2,3,4,5,6.], index=collect(0:4) .* 6 .+ 1)
lin = DataFitting.flatten(dom)

Actually, all models are always evaluated on "flattened", i.e. linear, domains.

To get the vector of coordinates for dimensions 1, 2, etc. of the `dom`

object use the `dom[1]`

, `dom[2]`

, etc. syntax. For linear domains all such vectors have the same length, for cartesian domains the lengths may differ.

## Multidimensional fitting

As an example we will fit a 2D plane. The analytic function will accept two vectors for coordinates x and y, and 2 parameters.

f(x, y, p1, p2) = @. p1 * x + p2 * y

This function will be called during model evaluation, where only linear domains are involved, hence we are sure that both the `x`

and `y`

vectors will have the same lengths.

To create a multidimensional `Measures`

object we will populate a 2D array and use it as argument to the `Measures`

function:

dom = CartesianDomain(30, 40)
d = fill(0., size(dom));
for i in 1:length(dom[1])
for j in 1:length(dom[2])
d[i,j] = f(dom[1][i], dom[2][j], 1.2, 2.4)
end
end
data = Measures(d + randn(rng, size(d)), 1.)

To fit the model proceed in the usual way:

model = Model(:comp1 => FuncWrap(f, 1, 2))
prepare!(model, dom, :comp1)
result = fit!(model, data)

## Interactive Use

`DataFitting.jl`

provides several facilities for interactive use on the REPL:

- all the types (i.e.
`Domain`

,`Measures`

,`Model`

and`FitResult`

) have a dedicated`show`

method to quickly and easily inspect their contents. Simply type the name of the object to run this method; - To get the list of currently defined components in a model simply type
`model[:<TAB>`

; - To get a list of parameter for a component simply type
`model[:<COMPONENT NAME>]`

, e.g.`model[:comp1]`

. Remember that the component parameter can either be scalar`Parameter`

or`Vector{Parameter}`

. In the latter case the parameter name shown in the REPL has an integer index appended; - To get the list of model parameter in a result simply type
`result.param[:<TAB>`

;

## Parameter settings

Each model parameter has a few settings that can be tweaked by the user before running the actul fit:

`.val`

(float): guess initial value;`.low`

(float): lower limit for the value (default:`-Inf`

);`.high`

(float): upper limit for the value (default:`+Inf`

);`.fixed`

(bool): false for free parameters, true for fixed ones (default:`false`

);`.expr`

(string): a mathematical expression to bind the parameter value to other parameters (default:`""`

);

**Important note**: the default minimizer (LsqFit do not supports bounded parameters, while CMPFit) supports them.

Considering the previous example we can limit the interval for `p1`

, and fix the value for `p2`

as follows:

model.comp[:comp1].p[1].val = 1 # guess initial value
model.comp[:comp1].p[1].low = 0.5 # lower limit
model.comp[:comp1].p[1].high = 1.5 # upper limit
model.comp[:comp1].p[2].val = 2.4
model.comp[:comp1].p[2].fixed = true
result = fit!(model, data, minimizer=CMPFit.Minimizer())

To remove the limits on `p1`

simply set their bounds to +/- Inf:

model.comp[:comp1].p[1].low = -Inf
model.comp[:comp1].p[1].high = +Inf

As another example we may constrain `p2`

to always be twice the value of `p1`

:

model.comp[:comp1].p[2].expr = "comp1__p1 + comp1__p2"
model.comp[:comp1].p[2].fixed = false

**Important note:** each time you change one (or more) parameter expression(s) you should call `prepare!`

passing just the model as argument. This is required to recompile the model:

prepare!(model)
result = fit!(model, data)

Note that in the example above we had to fix the `p2`

parameter otherwise the minizer will try to find a best fit for a parameter which has no influence on the final model, since its value will always be overwritten by the expression. The only situation where the parameter should be left free is when the expression involves the parameter itself, e.g.:

model.comp[:comp1].p2.expr = "comp1__p1 + comp1__p2"
model.comp[:comp1].p2.fixed = false
prepare!(model)
result = fit!(model, data)

## Component settings

A model component can be disabled altoghether and its evaluation fixed to a specific value by calling `setcompvalue!`

, e.g.:

setcompvalue!(model, :comp1, 10)

In further evaluation the `comp1`

component will always evaluate to 10. Note thet the component parameters are completely ignored in this case, so be sure to have at least one free parameter before running the fit.

To restore the normal component behaviour simply set its value to `NaN`

, i.e.:

setcompvalue!(model, :comp1, NaN)