I’d like to share a presentation that was shared with me by a colleague interested in making Julia’s ecosystem more widely used. His conjecture is that Julia has remarkable capabilities, but if we only increased the usability we would benefit a wider group of users. I’d like to share this presentation publicly and perhaps start with slide 8 which has concrete thoughts as to how to do this, as do other slides. I want to see if we can convert these into actionable issues - what would be a good way to get started? I suppose in some cases, it is about filing issues in relevant packages, fixing things one at a time, and in other cases it is about API design or even documentation (which comes in 4 types as you may know or can google).
My main purpose here is to get a conversation started.
However, there are some outliers, such as Clustering, which expect records in columns.
It appears that MultivariateStats sometimes wants records in rows and other times in columns.
Yup, I had a demo that was frustratingly difficult to write because of this. I don’t think that the issue is more APIs by things like MLJ handling this issue, I think it needs to be solved at the source. JuliaStats should have a manifesto that says “statistics APIs should work like this”, then all of its packages should enforce that form. And because JuliaStats is so much of the stats ecosystem, hopefully other people will then follow that choice. The issue is that it’s not even self-consistent right now, and so there is no norm to follow. Once one way gets enough steam behind it, it’ll naturally start to flip the whole ecosystem.
I think following the convention of DataFrames is well-motivated, and so just saying “this is what we will always do for the JuliaStats standard libraries”, considering anything that doesn’t follow it a bug, and opening an issue for every non-conforming case is a good start.
“records in rows” is the right choice but it is ironic how not easy that question is (Julia has column-major order, so in reinterpreting a vector of homogenous records as tuples into a matrix suggests records ending up in columns.) So part of why it is not uniform through Julia’s ecosystem is the mismatch between what everybody does and what would be natural from the matrix layout.
I fought with this “records in rows” thing for a while whne I was building TupleVectors.jl, and found it to work out really well to use a combination of ArraysOfArrays and ElasticArrays. With that you can work with a vector of arrays that under the hood is really just a single array, and still be able to push! to the end, just as you’d expect.
I think there’s a lot more of this sort of thing we can do. Find the most efficient storage for the data given typical access patterns, and optimize for that. But have a layer on top that makes things intuitive for the end-user.
Given the slides, the discussion, and personal experience in doing statistical analysis with Julia I would tend to think the following:
it would be great to develop an API that is Tables.jl based that would be consistently used across packages; this would cover most of simple models that users need.
still the packages should/can provide a low-level API that relies on data that is efficiently represented for the problem at hand (using Tables.jl interface will not be efficient for many models); an example of this is GLM.jl that provides both the high-level and low-level API and I think it makes sense.
even given these assumptions the Tables.jl interface will in my opinion not be enough. The major case are n-dimensional arrays (usually dimension higher than 3 is not needed) optionally with named dimensions; I think it would be good to agree on a common standard also here.
The slides talk about “data in rows” vs “data in columns”, but there’s not really any indication of what that means. It seems like “data” here really refers to independent observations. Is that right?
The one place I hit this that was initially confusing to me was in Distributions.jl, where taking multiple samples of an MvNormal gives you a Matrix with independent columns. This seems easily improved using ArraysOfArrays and ElasticArrays. But the slides also talk about “records”, and I’m not sure what that is supposed to mean. I guess this is the same thing, independent observations?
Could we talk some more about the problem? Maybe some examples of inconsistencies across packages, and some things that are confusing to newcomers?
Also, what about TableTraits.jl? I haven’t kept up with either of these in any detail, so I don’t have a sense of how they compare.
You hit another issue that indeed should be mentioned - that it is not only what packages take as an input but also what they produce as an output.
I assume - as you commented that the slides talk mean that observations (not necessarily independent) should be stored in rows while features should be stored in columns.
Some examples - per your request (I do not try to give an exhaustive list):
Distributions.jl can generate:
single observation: a number, a vector of numbers, or a matrix of numbers
a set of observations: a vector of numbers, a matrix of numbers (observations in columns), a vector of matrices
Clustering.jl:
input: a matrix (observations in columns) or a matrix (distances)
output: vectors
GLM.jl:
input: formula + table OR target vector + feature matrix (observations in rows)
output: vectors
Statistics.jl and StatsBase.jl:
input: mixed API, but in general not Tables.jl aware, if matrix is passed you can specify which dimension is treated to hold observations
HypothesisTests.jl:
input: mixed: sometimes observations are assumed to be passed, sometimes contingency tables, not Tables.jl aware; if observations are passed then they are generally assumed to be passed in rows
MultivariateStats.jl:
not Tables.jl aware, different methods assume different input, e.g. PCA assumes observations in columns, while linear regression assumes observations are in rows
So, if you want to have a producer of a stream of vectors of equal dimensions and a way to collect n of them them into a suitable container, there is not really a canonical way to do it in Base and no agreement how to do it. That is something which is more fundamental than statistics/data problem.
That’s a challenge in itself; the community seems to be diluted in some core areas like Julia Statistics · GitHub
For instance, an issue with a test returning p-values higher than one was reported in January 2018 and the problem is still there (I just checked):
a = [12,10,7,6,3,1]
b = [11,9,8,5,4,2]
julia> pvalue(MannWhitneyUTest(a,b))
1.0627705627705626
That’s why I might not entirely agree with slide 8 strategy to encourage the use of Julia:
Standardizing the format of data in Julia is essential to making data analysis in Julia easy and consistent.
(And to encouraging people to switch from other platforms).
What prevents me from truly recommend Julia core packages for serious work in Statistics is not the fact that I need to transpose data from one package to the next; though I find some formats surprising and mildly annoying I always try to assume that there is a technical reason for that to be the case. What truly prevents me for recommending Julia packages for production -as much as I love the language which I do- is the light engagement the community has at this point in some of these core areas.
I don’t see an easy solution for this though; perhaps reaching out for academic institutions and ask them if they would like to adopt some of these semi-orphan packages?
really not sure. columnar store has been agreed as the best choice for analytical workloads. The row convention I think came from the need for ACID and lack of RAM of true of DBMS designed in the 70’s.
Today, the DB landscape is diverse. Some are row based and some are columnar. So can’t agree with the statement in the slides.
I just proposed a PR to fix that specific problem here. I think the discussion on incorrect results, or problems with the statistics ecosystem at large bit of a tangent, though, and should be split into it’s own thread if possible.
I agree broadly with the criticisms in these slides. Two gripes in particular about input types
I find it difficult to use basic HypothesisTests.jl tests. It’s hard to know when functions want two vectors, or a matrix, or summary statistics (like counts), or something else. I’m not too experienced with “classical stats”, being an economist who just runs regressions, but it’s definitely frustrating. Something that combines HypothesisTests.jl with Tables.jl would be useful.
One particular part of this frustration is that different inputs are handled via dispatch, rather than keyword arguments, which makes errors hard to reason about.
2.
Just to clarify, I think we should all agree that data[1, :] should return the first observation, and data[:, 1], should return the first variable and analysis packages shouldn’t care how things are stored internally. That is to say, records are rows.
The implementation of different table types can store observations however they see fit. Performance considerations should be in the job of the analysis packages. If a package really works best with a vector of vectors, where each vector is an “observation” (record), then maybe it can check Tables.rowaccess and do a conversion as needed.
Absolutely. ArraysOfArrays + ElasticArrays solves this problem so nicely, I hope others will try it.
I think the problem here is that the language we’re using is still very imprecise. I’ve read the slides and above posts as being about the interface, while (if I’m reading your comment right) the benefit of columnar store is for implementation.
I hope we can all be more explicit about this, that we don’t have an implementation problem, but an interface problem.
This is a clear articulation - that our first problem is the interface design problem. As everyone points out - there are many different kinds of problems. I liked @bkamins’ Tables.jl suggestion as a standardization. Doesn’t mean everyone has to switch to rows - can always have two interfaces, I suppose.
If the broad group of people here come up with a concrete design and get it implemented in a couple of the core stats packages - then we can push for ecosystem wide adoption as well.
How common are non tabular and ND, where n > 2, algorithms in stats?
I’m wondering if we should be moving away from any table concept at all and just talk about observations, targets etc like in ML. ie keeping this very abstract.
Another benefit is that we’re further decoupled from implementation.
Is that too radical/unwieldy for more prosaic use cases?
I can only think of functional data analysis: CRAN - Package fdakma as an example, but I don’t really work with exotic frequentist statistics.
I agree that standardizing around Tables.jl would be a good start.
Another issue that I mentioned elsewhere when we discussed this question before: one of the problems blocking standardization of output formats is that we haven’t standardized on a package to represent arrays with dimension names (AxisArrays, NamedArrays, AxisKeys.jl, AxisIndices.jl, DimensionalData.jl…). This is an issue for many functions that could take a Tables.jl table, as in this case the output often needs to include the column (variable) names: correlation matrix (see here for an example), distance matrix, PCA, model coefficients… Luckily there’s a lot of activity in that area, though AFAICT the BoF at JuliaCon didn’t reach a clear conclusion (see here).
What makes this problem particularly hard to solve is that defining a common API isn’t enough (contrary to what happens for input objects). Indeed, to return an object, you need to depend on a particular package, not just on a generic API.
I would like to raise a non-technical point here which is nonetheless potentially relevant:
Centrifugal forces: Composability in Julia is hugely decentralizing, as numerous small packages can be combined to achieve complex behavior otherwise only available in R base or Python mega projects like sklearn
In some ways the ‘code’ part of Julia is too easy, which means many wonderful contributors ‘stand up’ a project with great functionality and aren’t required to harness a broader community to get things built. One example here is DataFramesMeta and DataFrameMacros…people may have different preferences here, but the latter is essentially single-contributor, but hugely powerful and fun to use, covers much (most?) of DataFramesMeta features. Achilles heel: last commit to the package was on June 29th and docs are hosted in readme.
Statistical story which Julia (in common consensus) got massively right is missing vs nothing, lots of ideation happened in ‘package space’, but alignment and standardization imho took place within main lang community / language dev community. We don’t have good examples for how similar ‘centripetal forces’ work within package space, package orgs might be a place for this to happen, but package/maintenance standards seem quite disparate within an org, so not necessarily a ‘quality / consistency indicator’.
tldr Julia stats is able to achieve surprising levels of coverage conditional on community size thanks to technical attributes of language (and entrepreneurial package devs), but we don’t yet have community norms, processes and rituals that reproducibly drive convergence and consensus where it is necessary.
contingency tables (where counts and not observations are stored and you need names for all dimensions)
ML applications when very often you process data in higher dimensions (e.g. series of images)
Out of those two examples. The former is core of many statistical analyses (both as an input and as an output). The latter can probably be left out as a domain of ML not Statistics.
This is a great point. In Julia it’s very easy to start a new package to solve some subproblem, and have it work well with other parts of the ecosystem. Turing didn’t meet my needs, so in 2015 I started working on Soss. And last year I finally gave up on Distributions and started working on MeasureTheory. The ability to do this sort of thing is a great asset to the language and the community, and I’d hate to see it go away.
If two packages have similar momentum, the one that makes better design choices will generally be adopted and further developed by the community. In many ways, I think momentum is a bad thing (e.g. without it, the world would already have moved on from Python). And standardization can only add to momentum.
The only real downside I can see to letting things naturally evolve (or even actively pushing against momentum) is when there are competing approaches with incompatible and arbitrary design choices.
Very nice! A couple of follow-ups on this:
Before the tidyverse, R was kind of a mess. There were three or four entirely different systems for doing OOP. It was very loose and Perl-like, with no language structure to guide development. This was a case where standardization really made a big difference; IMO it’s exactly that case of “competing approaches with incompatible and arbitrary design choices”, which is a large part of why it was so successful.
R and Python development are fairly well-funded, in particular I assume Hadley and Wes are both making a living doing this stuff. In Julia it’s very different. At least as it seems to me,
A few people are funded to work on Julia itself
Many people are funded and using Julia for R&D, and contribute to Julia and libraries as a way of furthering this
At least some people (e.g., me) are mostly not yet funded, but are actively developing Julia packages as a way to get there.
On this last point, between these three camps there are stark differences in goals and priorities. Hopefully this will stabilize as the ecosystem and community grow.
I’m not so sure. PPL, in particular, blurs the line between stats and ML. I’d expect this line to only get blurrier at time goes on.