Julia is great, but there are many limitations in putting it to work for doing statistics. With the help of Viral, Alan, and @bkamins, a bunch of us got going on a systematic attack on this problem. We have gone after this top-down and bottom-up:
Top down: We form opinions about important gaps in the Julia package ecosystem and set about filling these.
Bottom up: We have begun on full implementations of applied statistics research papers using Julia, thus discovering limitations along the way. We file bug reports, tests and feature requests, and also try to build the requisite code either as PRs or as new packages.
We have done:
Improvements to GLM: We have done one new distribution (Geometric) and one new link function (Power) and some small improvements (PR, PR). Coming up: (1) A choice between the (existing) fast+imprecise Cholesky decomposition vs. (a new) slower+precise QR decomposition in the iterative least squares, (2) Improved handling of collinear data, and (3) A paper with comparisons between R/Julia/SAS GLM implementations on features, performance and correctness.
A framework for statistical models (CRRao.jl): Applied statisticians value a consistent API for a wide variety of statistical models. The package embeds a consistent API, and a group of models that are ready to use. Coming up: We will write more models, we hope others will also build new models in this framework. We will build out the API to support a sensible workflow for statistical modelling.
A time series class (“TSx”): For working with time series data, a set of metaphors and operations are required. Our TSx package is syntactic sugar on top of the powerful capabilities of DataFrames.jl and is easily maintainable. In previous years, we have been intensive users of zoo and xts in R, for working with financial and macro data, and we have brought these experiences to bear on the design of TSx while keeping the design flexible enough to incorporate use cases from other fields.
Working with survey data (Survey.jl): We have a small set of much-used functions for working with survey data.
Working with the VIIRS night lights data (NighttimeLights.jl): Satellite imagery of nighttime lights is a valuable path to observing economic prosperity at high-frequency and high resolution. This package is a complete set of steps for cleaning and bias correcting the raw data that’s released by NASA/NOAA.
Distance-to-Default: We have implemented the Merton Model in DtD.jl to measure the credit risk to a firm.
Coming up: Measuring the precision of simple statistical calculations (NISTTests.jl): We are packaging a group of test cases from the US NIST, as functions that measure the precision of a supplied Julia function.
We are keen to make these good; please do criticise our work and help us make it better.
We welcome your interest and involvement in carrying this work forward. Folks in India: the team is located in Bombay, Delhi, Madras and Pune: we can readily meet up in any of these places. We are part of the 2022 Google Summer of Code.
Thanks for all of this! I was starting to think a larger project was underlying all these cool PRs.
I’m happy to discuss any lacks you may identify in the existing API in StatsAPI, DataAPI or other packages so that we can develop a consistent interface across the ecosystem.
Re GLM.jl: The QR decomposition is already implemented, it’s just not exposed in a convenient user-facing way (e.g. via a keyword argument to glm()/lm()).
Several of the JuliaStats contributors (myself, @ararslan, @dave.f.kleinschmidt) have day jobs at Beacon Biosignals. In addition to our contributions to the JuliaStats organization, we’ve also open-sourced several statistical packages developed in our professional roles:
Effects.jl Similar in spirit to R’s effects package, but quite different (more Julian) in interface
StandardizedPredictors.jl Do scaling, centering, other standardization as part of your model formula. Provides a nice display of relevant bits of info in your coef table (e.g. center point) and plays nicely with things like predict.
We also have a few more contributions planned in the foreseeable future.
A few more specific comments on some of the topics you raised:
As @palday noted GLM already supports the QR decomposition but it’s not the default. We discussed this in the past and I agree it would make sense to use it by default as Cholesky seems to be less robust in ill-conditioned cases (including nearly-collinear variables) where R succeeds.
I don’t completely understand the goal of this package. Thanks to StatsModels (and StatsAPI) modeling packages are already supposed to use an unified API. This can certainly be improved, but I would think this requires adding new elements to the common API (keyword argument names…) and ensuring that each package implements it, rather than having One Package to Rule Them All. Is there something I’m missing? (I must say I don’t really get the principle of Zelig either. )
This is an area that is particularly useful for a social scientist like me, so a big +1!
Nice. Wrapping a DataFrame is indeed an interesting alternative to TimeArray when you need to store heterogeneous columns. It would be good to try using consistent interfaces across TSx, TimeSeries and DataFrames where it makes sense.
Nice. Wrapping a DataFrame is indeed an interesting alternative to TimeArray when you need to store heterogeneous columns. It would be good to try using consistent interfaces across TSx, TimeSeries and DataFrames where it makes sense.
Yes, it would be good to have consistent interfaces across TSx and TimeSeries, though, we are not considering DataFrames because, IMHO, DataFrames syntax isn’t the best for timeseries-like data manipulation. But, if users do want to use DataFrames syntax then they can always use TS.coredata property to directly manipulate the underlying DataFrame albeit with certain precautions.
As of now, TSx interfaces somewhat resemble the zoo and xts syntax in R but I think we can create method aliases to provide users with an option of using TimeSeries interfaces.
Thanks a lot, @palday and @nalimilan.
Yes, we already have QR decomposition in place, we need to add a few more delbeta! functions to accommodate multicollinearity/rank deficiency in lm and glm models.
So our plan is something like the following:
Move forward with PR #340 to add the multicollinearity feature in the existing GLM
I have personaly noticed a lot of features you would expect to use on a multivariate statistics course are missing. For instance a variety of normality tests and others that you can readily find in R.
I would love to be part of this effort to revamp Julia’s stats capabilities. Is there a way I or anyone else could join this concerted effort?
Can I ask – is there a reason why HypothesisTests.jl’s Anderson-Darling doesn’t work for you? It’s probably the most common test for normal distributions, alongside Shapiro-Wilk and Kolmogorov-Smirnov. I believe the AD test was chosen over Shapiro-Wilk because SW does poorly against thick tails and datasets with many ties, and is also nongeneralizable – it only works for testing the normal distribution, unlike Anderson-Darling which can be used to test any model. KS has very bad power in general.
FWIW, I think AD has to have custom probability distributions computed for each non-Normal distribution you want to test. So it can in theory test many distribtuions, but only a subset are supported.
The Anderson-Darling test makes use of the specific distribution in calculating critical values. This has the advantage of allowing a more sensitive test and the disadvantage that critical values must be calculated for each distribution.
Shapiro-Wilk test was developed to test particularly for testing Normality. If you have a bunch of samples, say X1, X2,…,Xn ~ f(x), and you want to test H0: f follows Normal vs Ha: f does not follow Normal.
On the other hand, the purpose of the Anderson-Darling test is completely different. The purpose of the AD test is to check if you have two different samples. If both the samples are generated from the same distributions or not. That is if you have one sample say {X1,X2,…Xn} ~ f(x) and you have another sample {Y1,Y2,…,Ym} ~ g(y) you would like to test:
H0: f(x) = g(y) vs Ha: f(x) =!= g(y)
It does not require any distributional assumption. It is a non-parametric test. Now what you can do is something like this. Suppose you have only {X1, X2,…Xn} ~ f(x), and you want to check if
H0: f(x) = N(0,1) vs Ha: f(x) =!= N(0,1)
You can simulate {y1,y2,…,ym} from N(0,1) and plug it in AD test. This is the practice that is typically followed. I believe the statistical power varies significantly depending on the distributions.