Julia is great, but there are many limitations in putting it to work for doing statistics. With the help of Viral, Alan, and @bkamins, a bunch of us got going on a systematic attack on this problem. We have gone after this top-down and bottom-up:
Top down: We form opinions about important gaps in the Julia package ecosystem and set about filling these.
Bottom up: We have begun on full implementations of applied statistics research papers using Julia, thus discovering limitations along the way. We file bug reports, tests and feature requests, and also try to build the requisite code either as PRs or as new packages.
We have done:
-
Improvements to GLM: We have done one new distribution (Geometric) and one new link function (Power) and some small improvements (PR, PR). Coming up: (1) A choice between the (existing) fast+imprecise Cholesky decomposition vs. (a new) slower+precise QR decomposition in the iterative least squares, (2) Improved handling of collinear data, and (3) A paper with comparisons between R/Julia/SAS GLM implementations on features, performance and correctness.
-
A framework for statistical models (CRRao.jl): Applied statisticians value a consistent API for a wide variety of statistical models. The package embeds a consistent API, and a group of models that are ready to use. Coming up: We will write more models, we hope others will also build new models in this framework. We will build out the API to support a sensible workflow for statistical modelling.
-
A time series class (“TSx”): For working with time series data, a set of metaphors and operations are required. Our TSx package is syntactic sugar on top of the powerful capabilities of DataFrames.jl and is easily maintainable. In previous years, we have been intensive users of zoo and xts in R, for working with financial and macro data, and we have brought these experiences to bear on the design of TSx while keeping the design flexible enough to incorporate use cases from other fields.
-
Working with survey data (Survey.jl): We have a small set of much-used functions for working with survey data.
-
Working with the VIIRS night lights data (NighttimeLights.jl): Satellite imagery of nighttime lights is a valuable path to observing economic prosperity at high-frequency and high resolution. This package is a complete set of steps for cleaning and bias correcting the raw data that’s released by NASA/NOAA.
-
Distance-to-Default: We have implemented the Merton Model in DtD.jl to measure the credit risk to a firm.
-
Small improvements: We have a good Lowess.jl.
-
Coming up: Measuring the precision of simple statistical calculations (NISTTests.jl): We are packaging a group of test cases from the US NIST, as functions that measure the precision of a supplied Julia function.
We are keen to make these good; please do criticise our work and help us make it better.
Our main page is https://github.com/xKDR . The persons in this project are:
From XKDR Forum: Susan Thomas, Ayush Patnaik, Ajay Shah.
Independent researchers: Mousum Dutta, Chirag Anand.
From Chennai Mathematical Institute: Sourish Das, undergraduate students (Siddhant Chaudhary, Harsh Arora, Naman Kumar), masters students (Arnab Sen, Anisha Saha, Tanuj Sur, Sumeet Suley, Gudeet Siyan).
We welcome your interest and involvement in carrying this work forward. Folks in India: the team is located in Bombay, Delhi, Madras and Pune: we can readily meet up in any of these places. We are part of the 2022 Google Summer of Code.