[ANN] DataFitting package


#1

Dear community,

I’m happy to announce that the first working (and reasonably complete) version of the DataFitting.jl package is available on Github.

The main purpose of DataFitting is to provide a general purpose data fitting framework for Julia, allowing users to fit observational data against theoretical model (even very complex ones), in a simple and very fast way.

The key points of DataFitting are:

  • it handles data of any dimensionality;
  • the fitting model is built up by individual components, either provided by DataFitting or implemented by the user. All components are combined to evaluate the final model with a standard Julia mathematical expression;
  • all components results are cached, so that repeated evaluations with the same parameters do not involve further calculations. This is very important to speed up the fitting process when many components are involved;
  • it easily allows to use different minimizers, and compare their results and performances. Currently two minimizers are supported (CMPFit and LsqFit)
  • it provides several facilities for data exploration, interactive fitting, and displaying of results.

The development of DataFitting started a few weeks ago because I wanted to port my software for spectral data fitting (QSFit) in Julia, and it turned out it can be easily adapted to handle similar problems in other research fields.

The package already provides all the basic functionalities, and the first release is foreseen in Feb. 2018.

Any comment, suggestion, bug report, contribution and criticism is very welcome!!


#2

Interesting. From the source it looks like maximum likelihood estimation, but the concept is not mentioned at all in the package, can you clarify what the statistical methodology is?

Also, you seem to use likelihoods directly, aren’t you concerned about overflow/underflow?


#3

The methodology actually depend on the minimizer: DataFitting only deals with the creation and evaluation of the model. However the only minimizers currently supported (CMPFit and LsqFit) actually perform least squares minimization.

Hopefully more minimizers will be available in the future, e.g. one dealing with Cash statistic (for count data) would be very useful. Adding them to DataFitting is also ery simple: the wrapper for LsqFit is less than 40 lines.

I don’t think I understood the question on overflow/underflow, since this is handled by the minimizer. The ones calculated by DataFitting (in the eval_residuals and eval_cost methods) are just for displaying purposes.


#4

Are you sure? This looks like a likelihood to me.

If it under/overflows, it will do so before it gets to the minimizer. This can happen easily to exponentials, that’s why log likelihoods are used in practice.


#5

Does it fit empirical copula? And then estimate the Gaussian copula from empirical CDF?


#6

What you linked is just the calculation of a Gaussian component, not the cost function to be minimized.
Moreover the argument of the exponential is always <= 0, hence it can’t overflow.

Finally, for the purpose of model evaluation for data fitting, an overflow will result in an Inf or zero result, which is desirable because the user will understand there is something wrong with the model formulation (and DataFitting tells you that something went wrong while displaying the results of evaluation).

Also, an underflow is typically not an issue because you’re comparing a model with noisy data.


#7

Yes, you can fit any data (of any dimensionality) against any model (whatever complex).
However, you need at least a clue on how the empirical copula looks like, in order to model it appropriately, and fit against the data.

Once fitted, you can use the resulting model parameters to calculate the CDF.


#8

I am using the term for the exponent, which can easily underflow, eg the exponent of Float64 has 11 bits. This is a well-known problem for likelihood-based methods, and occurs with modest amounts of data. Getting a 0 over a large area will not guide your optimization very well. However, if the package is good enough for you as it is, I will not belabor this point further.


#9

We’ll never get 0 on a large area since the residuals to be minimized (i.e. the empirical data - theoretical model) are always scattered by instrumental uncertainties, which (for typical applications) are orders of magnitude larger than machine epsilon.


#10

Looks interesting, I’ll definitely check it out next time I need to fit some data. I really like the interface of LMFIT, and it looks like you have many of the same features. It’s hard to tell from the readme, but is there a way to get list/collection of all the parameters of a model before and/or after fitting?


#11

Sorry… the documentation is yet to be written, it will be available with the first release in Feb. 2018.

However the answer is yes. You can:

  • fix a parameter value using mdesc.param[PARAMETER_NAME].fixed = true (see example in the README);
  • tie a parameter value to a mathematical expression involving other parameters with mdesc.param[PARAMETER_NAME].tied = :(any valid mathematical expression) (see example in the README);
  • read/write guess parameters before running the minimizer. E.g. in the “1D data fitting” example you may write model.param[:line1__sigma].value = 0.3;
  • retrieve all guess parameters (as an array) with getfield.(model.param.values, :value);
  • retrieve all best fit parameters and their uncertainties (as arrays) with getfield.(fitres.param.values, :value) and getfield.(fitres.param.values, :error)

#12

…forgot to add: in the REPL you can simply type:
model.param[: and hit TAB to have a list of parameter names to choose from. The same appplies to fitres.param.

Once you chose the parameter you’re interested in you can access the values as shown above.