I have spent a bit of time thinking about this. I don’t yet have any solutions, but there are a couple of things worth thinking about:

The scikit-learn approach of making an observation a real vector has definite drawbacks:

it makes it difficult to link variable names with their coefficients

not all algorithms need categorical variables to be encoded as vectors (e.g. decision trees can be applied directly to categories)

sometimes you need to attach more information to variables, e.g. in generalised additive models, you need a way to say which variables should be fit using splines, in Bayesian models you need priors, etc.

A lot of models involve nested fitting procedures, e.g. grid search over hyper parameters (often controlled via cross validation), with optimisation over internal parameters.

Having a convenient way to express this would be nice.

I wouldn’t advocate removing e.g. the ability to pass a dense matrix to these functions, when that is already optimal and easy-to-use.

I guess what I’m saying is if you’re chasing one generic interface that will work across different problems, data types, models, training algorithms, etc, you need to pin down what that interface is pretty strongly. So while I wouldn’t suggest removing an interface for e.g. matrix-style inputs, I might be inclined to suggest having distinct interfaces (perhaps different generic functions) for nested-style and matrix-style inputs.

This one is an interesting one; we do PCA on 3D point clouds with nested arrays of (static) arrays and the generated code is optimal (and the written code is extremely readable). I’m curious what using something like SplitApplyCombine.splitdimsview to create a nested input out of dense storage would do in terms of performance (I suspect some operations would be Julia loops rather than BLAS code so it may be a bit slower, but hopefully better than Vector{Vector{T}}).

Of course, it would be a dystopian world if users created a matrix, were forced to pass it through splitdimsview, then into the the model, which uses special logic to unwrap the nested view, and so-on, This seems cumbersome for both the user and library writer. So, I’ll admit there may be limitations on the utility of the philosophy I was pushing.

However, I do feel strongly about consistent, predictable and well-documented interfaces. If PCA had two entry points - one for matrices, and one for nested observations… well that would be fine - we just document the former is faster for PCA. I would guess many models could simply transform inputs one way or the other to suite their internal style. But explicit is better than implicit, I feel. Does that make sense?

I really like the idea of a dataset being any iterable over observations. We could also have simple wrapper types (e.g. RowObs(X) and ColObs(X)) for directly passing numeric matrices.

On top of this we could add traits, e.g.

availibility (e.g. streaming or multiple passes available)

extra interfaces (e.g. columnar interfaces, chunked data, etc.)

data indexes (e.g. unordered, sequential or time series, hierarchical)

There is some overlap: in my view an “observation” is up to the model (or some preprocessing step). That is, it could just be scalar (e.g. for fitting a univariate distribution), a vector, a named tuple, etc.

One issue which strikes me as tricky is how to treat dynamic models, where time t observations of some variables are explained using some number of their own lags, along with current and lagged values of other variables. If y(t) is explained as a function of y(t-1)…y(t-p), then one might define x(t)=[y(t-1) … y(t-p)] and then we have an “observation” [y(t), x(t)]. One can put these observations in the rows of an array, but that’s not an efficient use of memory. I don’t know how a general purpose design that avoids duplication of information can take this sort of thing into account.

ML spectator here, but I think this is the kind of data set that would benefit from an iterable of the observations more than a matrix form. One can simply define an iterator that hovers over the vector y and returns a view into it, e.g. @view y[t-1:-1:t-p] as x and y[t] as y. Of course, you can also define a custom matrix type to index directly into the vector, so that’s another arguably Julian approach.

I haven’t been following along closely but wanted to chime in that I’d like to make StatsModels.jl a generally useful solution to this problem of mapping from various input data types to a format consumable by a particular kind of model. I think the re-write of the formula language implementation I described in my juliacon talk is a good start at this (see this PR), and we’ve semi-independently converged on using both information from the source (e.g., type of the data) and the destination (the type of the model to be fit) to determine how to do the transformation.

I say semi-independently because that’s also necessary to know how to handle, for instance, functions that are not built-into the formula DSL: when different model types want to use the same syntax in different ways, you need to know where the data is going.

Another design decision that’s come up in StatsModels (see #32) is where the transformer lives once you’ve fitted a model. Currently we use a “wrapper” type (DataFrameStatisticalModel) to hold the fitted model, the (tabular) data it was fit to, and the formula which describes how to convert a tabular data source into the response and predictor arrays. The original goal was to make it possible for modeling packages to just work with a formula and data in the form of DataFrame, without having to take DataFrames (now StatsModels) as a dependency. GLM is the prime example of this: in that package, fit methods are defined for numerical array response/predictors. But the disadvantage of this approach is that you lose composability for packages that add features to GLM.jl models, (e.g., fancy covariance matrix estimators) unless the packages that provide the extra features depend on StatsModels, and delegate their new methods to DataFrameStatisticalModel objects.

For that reason, we want to get rid of the wrapper DataFrameStatisticalModel types. But then we’re left with a conundrum: either we need to make users hold onto the formula and transform the data themselves before calling fit(GLM...), or GLM.jl needs to take StatsModels as a dependency (maybe not such a big ask if there’s no heavy dependency on DataFrames there), or we need to change the fitting API to allow some mechanism to store and use a generic sort of data transformer on the fitted model. I’m leaning towards the last solution since it seems like the most generic, but it could be a hard sell to get everyone to change the fitting API and to standardize across stats/ML/etc. worlds.

Instead of building in the dataframe assumption to statsmodels, what about using @quinnj s Tables.jl? That would give queryverse and generic table interop.

I should have been clearer but I was referring to the historical state of things; removing the dataframes dependency is work in progress (something that #71 deals with)

My packages have been using StatsModels since it was part of DataFrames some time ago. I am looking forward to the new API. One feature which I asked for a longtime ago, but the current API was too strict to implement was to allow different mappings other than an AbstractVecOrMatrix. For example, for absorbing fixed effects my package does not store a sparse matrix for a aparse solver, but rather keeps it in a Vector{Vector{Vector{Int}}} (dimension, level, indices) which allows me to apply the within estimator very efficiently prior to fitting the model. Other aspects include, keeping track of variables in a similar manner for cluster robust variance covariance estimators or for panel data estimators (between, first-difference, etc.). Another aspect is how to hold multiple linear predictors for multi-stage estimators such as instrumental variables models.

I’m a total Julia newb here coming from a mostly R background, so please forgive me if this is completely unhelpful. The nested data structure mentioned earlier is a really convenient and natural way to manage data that is flexible for many types of data (in R they’re tibbles). However, there are times when the formula approach falls short in connecting tabular data of any form to a fitted model (some R based conversation on that here).

Also for what it’s worth, @juliohm makes a great point about the value in an interface that accommodates algorithms for different data types. A great deal of time is spent trying to figure out how to force imaging data into common linear algebra models while managing the overhead of going back and forth between original dimensions for other transformations. There are some people that would benefit enough from efficient management of this kind of data that they may be willing to drop Matlab even if it’s provided by their university.

@dave.f.kleinschmidt I’ve not been following your work closely either.
(So busy right now, not actually fully up to date on my own thread.)
I am tracking Terms 2 son of Terms (even if I am behind on reading).

This stuff going on in StatsModels, is currently my hope for this, I think.

I hope that we can converge to one package/solution that works happily for models coming out of ML backgrounds, and models coming out of Statistics backgrounds (and models coming out of bioinformatics, and econometrics etc etc.).

Hopefully, I’ll have time to catch up with you / your threads once my thesis is submitted.

Two relevant things that came up on slack: one is Rstudio’s re-write of the basic R modeling APIs as “tidymodels” (for instance, data preprocessing recipes). The second is scikitlearn’s ColumnTransformers (new in 0.20). I haven’t looked closely at either but clearly there’s demand for some kind of sane model fitting API that’s still sufficiently flexible and handles heterogeneous data well.

Another thing that came up on Slack is https://github.com/ablaom/Koala.jl, which appears to be it’s own little (unregistered) ML ecosystem. It has a few interesting ideas, one being that data splitting (e.g. train/test, or cross-validation) is just a wrapper type around the full dataset, along with a binary mask.

One problem I often bump into is to deal with different parameters types. Most algorithms want vectors (your error function should take as input a vector) but writing your model with a vector is often not so nice. Knet allows for dictionaries which is quite convenient, Optim seems to be taking only vectors, and using structs is also very nice (Parameters.jl has some utilities for that).

Unless something like that already exists, that’s a problem that could be tackled independently of other considerations. Conversions between dictionaries, structs and vectors would be a minimum, but maybe there’s a more general interface.

I hope you don’t find this annoying; I just feel that it would benefit the Julia ecosystem quite a bit and it doesn’t seem to require a lot of work but it does seem to require some general consensus how to go about it.