Taking Fitting Seriously



I’ve been thinking about taking our fitting APIs seriously for a while now.
(But am very busy so have not advanced it too far).

I got to briefly talk to some people at JuliaCon about it,
so I am quickly writing up some thoughts.

So the problem is you have data in some Tableish form.
Where I mean either a Table in the QueryVerse/DataStreams/Tables.jl (i’ve not yet looked closely at what @quinnj and co made yesterday) sense:
So an Iterator of NamedTuples, or a DataFrame or JuliaDB or etc.
or another form; like a Matrix,or a Iterator of tuples.

And you want to feed it to a Fit + Tranform system.
(or a Train + Predict system.)

Now different Fit + Transform systems have from a implementation perspective different expectations of the input.
Some want it to be a matrix with observations in columns; some want it in observations in rows,
Some want iterators of minibatches, etc.

Further they have different expectations from a algorithm perspective.
For example when training an SVM the output need to be in Max Margin form (+1/-1),
when training logistric regession it needs to be in plain form (1/0).
When working with continuous data as input to a neural net wants to be normalized,
but for a decision tree that doesn’t matter.

Or even classical staticical models expect inputs to be in design matrix form, with a column of 1s appended; where as machine learning models traditionally achieve the same thing with an internal bias.

So we have two things going on in this space right now MLLabelUtils.jl (@Evizero) which works for (mostly for) Matrixish data, and JuliaDB/StatsModels schema (@shashi , @dave.f.kleinschmidt ).

Very roughly speaking:
Both of them currently talk as input the data source, from which they attempt to work out the current representation of the data and if it is Ordinal/Categorical/Continous/etc,
and some optional hints as to what the current form, and the desired output form is.
From this they produce an output representation, that they think is what is going to be useful.

However, they do not know what you are going to do with the data, so they can’t do it perfectly – making the user have to do more work to provide hints.
Hints are not too bad when they are to deal with algorithmic details – there will always be scope for applying some expertise driven decisions here; but when it is row-major vs column major type implementation details that bugs me.

I propose an API that take as input the data source,
the hints, and the data sink (e.g. the fitable thing like a classifier being trained).
And we apply traits to data sinks in order to inform about what they want.
These traits are used to help fill in the hints defaults more intelligently.

Here is for example a system that does something along these lines for purposes of solving row major or column major

We need a schema-ish thing, that that take a source, a sink, and some hints,
that knows how to turn matrixish things and tableish things into a form that that sink can use for fitting and transforming.

NB: I am talking in very broad and rough strokes.


For a while now, I’ve been seriously (sorry, I couldn’t resist…) thinking about what “table-ish” data might actually mean. Figuring this out seems to be one of the key aspects of the above.

I will summarize a few thoughts/observations:

  • For “relational-algebra tables” I’ve been taking the standard (Codd/Wikipedia/mathematics) definition of a relation as collection of (named) tuples (or other things that fit a reasonable definition of “row”). So, to me, a table is a nested collection of rows. It’'s actually quite straightforward to implement a complete set of relational algebra with e.g. Vector{NamedTuple{names, ...}} together with map, filter, and so-on (even generators and comprehensions look very nice!). So rather than thinking of a matrix of columns and rows I’d rather think of a nested container (of rows, in the case performing the relational algebra).

  • I see e.g. collections of 3D points represented as a 3xN matrix, especially in MATLAB and Numpy but also in Julia. I always seem to forget if it is 3xN or Nx3 - and maybe I’m just braindead but I doubt this is unique to me. However, I mentally think of a collection of 3D points which in Julia translates roughly to Vector{SVector{3, T}}. It’s a nested container and I can say from experience it’s blaizing fast and a pleasure to work with.

  • In a lot of machine learning I see a matrix structure with features, classes and examples all in the one flat structure. I have a hard time remembering which axis means what, etc. I’d rather mentally model it as a list (Vector) of feature => class, or a list of features and a list of classes, or whatever. In either case the list of features is a nested structure.

I feel there’s an emerging pattern above about “table-ish” data being well represented as a nested container.

For a lot of our mathematical models we have become used to interacting with multidimensional arrays because they are soooo nice in MATLAB/Numpy/Julia/etc. However I would argue that we tend to overuse them slightly in some cases. I think this is partly because (a) most high-level languages aren’t efficient in dealing with nested containers (unlike Julia) and (b) we don’t have as many tools (or experience) for dealing with nested containers as we do with multidimensional arrays.

I’m imagining a situation in the fitting community where

  • It is trivial to wrangle your data into your form which is a collection of examples.
  • The fitting / ML interfaces expect (nested) collections of examples (i.e. features together with answers in the case of training - either together as “rows”, or in two seperate containers), otherwise just features (in the case of evaluation).

Anyway that’s my two cents. I think Jacob’s upcoming Tables.jl interface, tools like SplitApplyCombine.jl, and many other wonderful tools in the ecosystem and Base, will sort out the first point. I’m wondering how the fitting community would react to the second point?


(Just in case it wasn’t clear to anyone, a “flat” 2D container a would be accessed by something like a[i, j] whereas a “nested” 1D container of 1D containers b would be accessed something like b[i][j]. A seemingly small difference in this case but it adds up to a big difference to interfaces and code readability and maintainability)


I like the idea a lot as it would actually more resemble how I think about observations of data. However, it’s not clear to me how this would be as fast or faster than a matrix representation? I know very little of how Julia deals with the two cases you meantion a[1][2] vs a[1,2] but I would love to learn more. Especially in the context of more complex hierarchical models and deep networks.


Great question.

While reaching peak performance is sometimes a challenge (a flat matrix is quite efficient, after all!), we can often get a great amount of “free abstractions” in Julia.

For example, consider SplitApplyCombine.jl's function splitdimsview to see how a matrix would be split up into a vector of vectors with no copies. It’s a bit like view, really.

There’s potentailly a compiler optimization or two that could be done in Julia to make playing with this structure ever-so-slightly-faster (same for view and SubArray), but it should be fast to iterate and access as it is.


That is interessting. I did not know about SplitApplyCombine.jl. Sounds like we have some overlap.

julia> using MLDataPattern

julia> obsview([1. 2 3; 4 5 6])
3-element obsview(::Array{Float64,2}, ObsDim.Last()) with ...
 [1.0, 4.0]
 [2.0, 5.0]
 [3.0, 6.0]

julia> obsview([1. 2 3; 4 5 6], obsdim=:first)
2-element obsview(::Array{Float64,2}, ObsDim.Constant{1}()) with ...
 [1.0, 2.0, 3.0]
 [4.0, 5.0, 6.0]



That’s really great to see!

My mission, if you will, is to make a set of tools like this that can be shared across communities. For the case-in-point, I’m sure you will excuse the fact that coming from another domain I wouldn’t have guessed to look for MLDataPattern.obsview(x, obsdim=...) for this functionality :laughing:

Arguably I’d like to see these kinds of capabilities move to Base or a stdlib so that we can all share the benefits. :wink:


My thoughts are that the iterable of rows (Tableish) is the correct intermediate form for data.
I think that space is close to settled now (vs 2 years ago when it wasn’t)
Feel Fred to keep talking it here too

So what I am really getting at is the next part is to be able to automatically Wrangle it into a form that a Fittable system expects.
So that would mean stacking it back into a (Rows as obs) matrix for LIBSVM etc.

We need to do that wrangling, which I would characterise as fixing for implementation details, as well as things like converting categorical variables to onehot, which I would characterize as fixing for algorithm details.


Who is Fred, and would he be comfortable with me touching him? I want to talk about this here, but I don’t know if I want to feel up a stranger.

I’m not sure I have a great deal to contribute, though I will say that I need interfaces that go back and forth from 2D tables. On the input side, bioinformatics tools are typically pretty dumb, and use various formats that are typically just glorified tsv tables.

On the output side, I typically have to share results in the form of 2d tables, and I’ve often found it a bit confusing or cumbersome to take the results of a fitted model (or more often, many fitted models) and dump them to a spreadsheet. Note - I’ve primarily used GLM.jl and have usually figured it out, though with many posts to slack and lots of trial and error. A more streamlined and well-documented interface would be great (and I’m happy to volunteer writing docs and examples if that’s helpful).


Would this help PCA at all?

One thing about PCA is that it requires each observation to be a vector, rather than features as vectors, and this is a bit of a pain when working with DataFrames. If this interface could have a way of efficiently switching from row to column order depending on the optimizations of the fit, it would be great.


Yes. It will solve that.
Because PCA will have a trait saying

obsdim_input(::Type{PCA}) = ObsDim.First()

Or something to that effect.

Which will cause any input to become a Matrix with observations in Rows and Features in columns, before being feed to the current PCA code
Or the opposite if that is what is needed


I would like to add to the discussion a powerful generalization of ML interfaces that we should seriously consider in Julia on the lines of @andyferris comment.

Because Julia has amazing generics, we should see an observation as an object of any type as opposed to a vector of scalars (a.k.a. features). I don’t have an API in mind at the moment, but to make this statement more concrete, I would like to share an example that I shared on a GitHub issue in a different context.

In Compositional Data Analysis (see CoDa.jl), a composition is a vector of scalars where the entries cannot be manipulated directly as one would do in Euclidean geometry. The problem in manipulating the entries directly in this case is due to the fact that these entries are constrained to add up to one and to be positive (they live in a simplex). Many ML algorithms like PCA or Factor Analysis that depend on correlation calculations would fail or produce incorrect results if they treated a composition as a vector of scalars.

I don’t know if the example was clear, but I tried to make the point that we can be really unique with Julia if we move out from the idea of treating a data set as a set of vectors of features. A data set is really a collection of objects (possibly labeled), and as such we could model it with something more flexible like Vector{T}. As soon as we make the interface generic like this, it is the job of multiple-dispatch to pick the efficient ML implementation depending on the object type T, for example the vector-of-features view T = SVector.


As another example, take Computer Vision. There is always that moment where we have to flatten an image as a vector-of-features, feed that into an ML algorithm, then reshape to the original size. This is just something that we could avoid altogether if we had a more generic interface. In many cases this may not be an issue because Julia supports linear indexing as well, but if you ever wrote a ML code with images as objects, you know what I am talking about. Half of the code is pre- and post-processing the images to pass through the vector-of-features view.


I totally agree with @juliohm.

@oxinabox my point of jumping in here on this thread was more-or-less this: I don’t feel that the full amount of automatic data wrangling you suggest is necessarily where we need to go.

To illustrate why, here are some concrete examples:

  1. A vector of named tuples. Each (inner) named tuple could form an observation.
  2. A named tuple of vectors. Each (inner) vector could form an observation. We have named each observation with a Symbol, but that’s totally valid.
  3. A matrix of channels (i.e. an image of pixels with multiple pieces of data at each pixel). In this case the observations happen to be correlated (we use information from nearby pixels to make a prediction). But if we iterate the matrix we nonetheless get all the channels of a given pixel in each iteration. (Similarly, in the new TypedTables I’m allowing a table to be a matrix of named tuples, not neessarily a vector, as it is a perfectly natural relational operation to take the Cartesian product of two tables and the matrix is the natural Cartesian product of two vectors).

Basically, it’s my opinion that what you feed to the model (for training or for evaluation) should iterate observations. Having the function recieving this data have to guess whether it should turn the structure inside-out or not before proceeding feels like bad software engineering to me (since it’s unpredictable and uncontrollable). To me, it seems perfectly reasonable and expected that we support both 1 and 2 at the same time, for example.

Of course, not every ML model will accept every single type of observation. The automatic data wrangling could happen at the level of the inner structure. That process would look like (at least semantically) iterating over the outer structure and then converting each element into the form of observation that you work with. So you might convert your named tuple into an abstract vector, or whatever. As @juliohm says, your algorithm might expect a particular kind of object (and that’s what’s convert is for, right?). In many cases your algorithm may be agnoistic to the outer structure (images are an exception here).

There are other reasons to expect iterables of observations, such as online processing.

Anyway, sorry for writing so much on this thread, I hope this is more useful than disruptive.


To give a simple example of how this generic interface would be useful. Consider any ML algorithm that can be kernelized. Ex: kPCA, kFA. These algorithms only require the definition of an inner product between the observations and nothing more. In the CoDa example I gave above, I could pass a Vector{Composition} as my data set to this generic kPCA algorithm, and Julia would do its magic by calling the inner product I have defined for my type.

The point here is that we could have very complex data sets, and still perform ML on them directly without all the hassle of first converting them to a tabular or matrix-like format, which is traditional in other more limited programming languages.

I am looking forward to contribute more ideas and code, but I need to catch up with the ML initiatives and organizations in Julia. I know that @Evizero and @oxinabox are doing an amazing job on this front.


Also worth looking at @MikeInnes 's Flux.jl


Love Flux, but it is not relevant here.
Flux, pleasing does almost no data wrangling.
Good separation of concerns.
Also as a deep learning library it is in many ways its own beast – super configurable.
I would say it itself is not a fittable, but you could use it to define many different fitables.

@andyferris I am not sure I understand you. I am keen to hear more
Are you suggesting rather than semiautomaticly data wrangling we should modify every fitable to work with iterables of observations?

That used to be roughly my opinion, but after some discussions with others, (iirc including @MikeInnes and @ChrisRackauckas) I am convinced otherwise.
For performance reasons the current implementation of (say) PCA is optimal.
It takes a Matrix with columns of features.
To make it work with iterables of observations (which I agree is the ideal general purpose form), it needs data wrangling into that matrix.
So why do the data wrangling just for PCA when we can setup a declarative trait-based system to semiautomatically wrangle everything.

When I say semiautomatically,
I basically mean to setup good defaults.
But to allow them to be overwritten.
Thus solves the 2 problems of

  • incorrectly guessing feature type (ordinal vs continuous vs categorical etc) and thus doing wrong transforms
  • specifying own transforms based on expert knowledge/for science


Isn’t this precisely what StatsModels.jl does/aims to do?


No, it is a slight extension on that.


Should ‘Fitting seriously’ take into account CrossValidation? It would be nice to reuse preprocesses. Compute once and share the preprocess to different threads.

In any case I don’t see why the API should change depending on how the model iterates over examples. If the model uses stochastic SGD or full batch can be transparent to the user. Specially if some fancy way of encapsulating the data is developed, for cases where the data does not fit into ram. This would allow us to lazyload(X) and then fit(X,Y). This is what keras does with hdf5 for example.

It’s an interesting discussion that I expect it to be as long as taking transposes seriously :joy:.

At some point it would be hugely beneficial to have an entity such as JuliaML to encapsule all this so we have a coherent project involving all the tipical steps needed to apply machine learning in practise. I fell scikit learn is a great project to look upon.