[ANN] DataFitting.jl general purpose data fitting framework

Dear community,

I’m happy to announce that the first working (and reasonably complete) version of the DataFitting.jl package is available on Github.

The main purpose of DataFitting is to provide a general purpose data fitting framework for Julia, allowing users to fit observational data against theoretical model (even very complex ones), in a simple and very fast way.

The key points of DataFitting are:

  • it handles data of any dimensionality;
  • the fitting model is built up by individual components, either provided by DataFitting or implemented by the user. All components are combined to evaluate the final model with a standard Julia mathematical expression;
  • all components results are cached, so that repeated evaluations with the same parameters do not involve further calculations. This is very important to speed up the fitting process when many components are involved;
  • it easily allows to use different minimizers, and compare their results and performances. Currently two minimizers are supported (CMPFit and LsqFit)
  • it provides several facilities for data exploration, interactive fitting, and displaying of results.

The development of DataFitting started a few weeks ago because I wanted to port my software for spectral data fitting (QSFit) in Julia, and it turned out it can be easily adapted to handle similar problems in other research fields.

The package already provides all the basic functionalities, and the first release is foreseen in Feb. 2018.

Any comment, suggestion, bug report, contribution and criticism is very welcome!!

8 Likes

Interesting. From the source it looks like maximum likelihood estimation, but the concept is not mentioned at all in the package, can you clarify what the statistical methodology is?

Also, you seem to use likelihoods directly, aren’t you concerned about overflow/underflow?

The methodology actually depend on the minimizer: DataFitting only deals with the creation and evaluation of the model. However the only minimizers currently supported (CMPFit and LsqFit) actually perform least squares minimization.

Hopefully more minimizers will be available in the future, e.g. one dealing with Cash statistic (for count data) would be very useful. Adding them to DataFitting is also ery simple: the wrapper for LsqFit is less than 40 lines.

I don’t think I understood the question on overflow/underflow, since this is handled by the minimizer. The ones calculated by DataFitting (in the eval_residuals and eval_cost methods) are just for displaying purposes.

Are you sure? This looks like a likelihood to me.

If it under/overflows, it will do so before it gets to the minimizer. This can happen easily to exponentials, that’s why log likelihoods are used in practice.

Does it fit empirical copula? And then estimate the Gaussian copula from empirical CDF?

What you linked is just the calculation of a Gaussian component, not the cost function to be minimized.
Moreover the argument of the exponential is always <= 0, hence it can’t overflow.

Finally, for the purpose of model evaluation for data fitting, an overflow will result in an Inf or zero result, which is desirable because the user will understand there is something wrong with the model formulation (and DataFitting tells you that something went wrong while displaying the results of evaluation).

Also, an underflow is typically not an issue because you’re comparing a model with noisy data.

Yes, you can fit any data (of any dimensionality) against any model (whatever complex).
However, you need at least a clue on how the empirical copula looks like, in order to model it appropriately, and fit against the data.

Once fitted, you can use the resulting model parameters to calculate the CDF.

I am using the term for the exponent, which can easily underflow, eg the exponent of Float64 has 11 bits. This is a well-known problem for likelihood-based methods, and occurs with modest amounts of data. Getting a 0 over a large area will not guide your optimization very well. However, if the package is good enough for you as it is, I will not belabor this point further.

We’ll never get 0 on a large area since the residuals to be minimized (i.e. the empirical data - theoretical model) are always scattered by instrumental uncertainties, which (for typical applications) are orders of magnitude larger than machine epsilon.

Looks interesting, I’ll definitely check it out next time I need to fit some data. I really like the interface of LMFIT, and it looks like you have many of the same features. It’s hard to tell from the readme, but is there a way to get list/collection of all the parameters of a model before and/or after fitting?

Sorry… the documentation is yet to be written, it will be available with the first release in Feb. 2018.

However the answer is yes. You can:

  • fix a parameter value using mdesc.param[PARAMETER_NAME].fixed = true (see example in the README);
  • tie a parameter value to a mathematical expression involving other parameters with mdesc.param[PARAMETER_NAME].tied = :(any valid mathematical expression) (see example in the README);
  • read/write guess parameters before running the minimizer. E.g. in the “1D data fitting” example you may write model.param[:line1__sigma].value = 0.3;
  • retrieve all guess parameters (as an array) with getfield.(model.param.values, :value);
  • retrieve all best fit parameters and their uncertainties (as arrays) with getfield.(fitres.param.values, :value) and getfield.(fitres.param.values, :error)
1 Like

…forgot to add: in the REPL you can simply type:
model.param[: and hit TAB to have a list of parameter names to choose from. The same appplies to fitres.param.

Once you chose the parameter you’re interested in you can access the values as shown above.

1 Like

Looks like a nice package !

It works with LsqFit, but I can’t make it work with CPMFit, using the following (from the Readme) :slight_smile:

using CMPFit
result = fit!(model, data2, minimizer=CMPFit.Minimizer())

Throws the following error (with CMPFit v0.3.1 installed)

UndefVarError: Minimizer not defined

As it seems it is the only minimizer that supports bounded parameters.

Dear @Djoull, sorry for the late answer…
That package has changed its name, it is now called GFit.jl.

I’m still working on it, so you will likely found further errors in the documentation. However the package already has all the functionalities, and I’m already using it to fit very complex models.

I hope to release the first version in a few weeks from now. Stay tuned… :wink:

4 Likes

@gcalderone GFit.jl looks interesting. Would it be suitable for constructing a generalized system for fitting a model which consists of a continuum/background model plus N individual “Gaussian-like peaks” where N varies from data set to data set? So essentially there would be two sub-models - a continuum model and a peak model but N instances of the peak model with different parameters.

Definitely yes!
It is exactly how I am using it (fit of several emission lines + background in the spectrum of an astronomical source).

However I don’t yet have the documentation ready. I’ll try to produce a minimal working example here later this evening.

I’m really interested. I’ll check back.

Seems nice, I will give it a try !

Here’s a minimal working example showing the GFit.jl capabilities:

using Pkg
Pkg.add("CMPFit")
Pkg.add(url="https://github.com/gcalderone/GFit.jl", rev="master")
Pkg.add(url="https://github.com/lnicastro/GFitViewer.jl", rev="master")

using GFit, GFitViewer

# Setup domain and model
λ = Domain(0:0.05:6)
model = Model(λ,
              :line1 => GFit.Gaussian(1, 2, 0.2),
              :line2 => GFit.Gaussian(1, 3, 0.5),
              :bkg => GFit.OffsetSlope(0.5, 1, 0.1))

# Generate mock measurements (with a 10% noise)
noise = 0.1
true_model = model()
data = Measures(true_model .+ noise .* randn(length(λ)), noise)

# Fit model to data
bestfit = fit!(model, data)

# Display best fit model
viewer(model, data, bestfit)

# Print best fit values and uncertainties for a few parameters
println("line1.norm: ", bestfit[:line1].norm.val, " ± ", bestfit[:line1].norm.unc)
println("bkg.slope: ",  bestfit[:bkg].slope.val , " ± ", bestfit[:bkg].slope.unc)

# ...or loop through all free parameters
for cname in keys(model)
    for par in propertynames(bestfit[cname])
        getproperty(bestfit[cname], par).fixed &&  continue
        println(cname, ".", par, ": ", getproperty(bestfit[cname], par).val, " ± ", getproperty(bestfit[cname], par).unc)
    end
end

# Fix a parameter value
model[:line1].center.val = 2
model[:line1].center.fixed = true

# Link two parameters through a mathematical expression
patch!(model) do m
    m[:line2].center = m[:line1].center + 1
end
model[:line2].center.fixed = true  # no need to fit this parameter, hence we freeze it

# Fit updated model to data using CMPFit minimizer
bestfit = fit!(model, data, minimizer=GFit.cmpfit())

# Display best fit model aong with all components
viewer(model, data, bestfit, showcomps=true)

As anticipated, the GFit and GFitViewer still lack the documentation, hence they are not yet ready to be registered and released. But most of their functionalities are already in place.

1 Like

@gcalderone Thanks! The linking of parameters is particularly interesting.