I’ve been thinking about taking our fitting APIs seriously for a while now.
(But am very busy so have not advanced it too far).
I got to briefly talk to some people at JuliaCon about it,
so I am quickly writing up some thoughts.
So the problem is you have data in some Tableish form.
Where I mean either a Table in the QueryVerse/DataStreams/Tables.jl (i’ve not yet looked closely at what @quinnj and co made yesterday) sense:
So an Iterator of NamedTuples, or a DataFrame or JuliaDB or etc.
or another form; like a Matrix,or a Iterator of tuples.
And you want to feed it to a Fit + Tranform system.
(or a Train + Predict system.)
Now different Fit + Transform systems have from a implementation perspective different expectations of the input.
Some want it to be a matrix with observations in columns; some want it in observations in rows,
Some want iterators of minibatches, etc.
Further they have different expectations from a algorithm perspective.
For example when training an SVM the output need to be in Max Margin form (+1/-1),
when training logistric regession it needs to be in plain form (1/0).
When working with continuous data as input to a neural net wants to be normalized,
but for a decision tree that doesn’t matter.
Or even classical staticical models expect inputs to be in design matrix form, with a column of 1s appended; where as machine learning models traditionally achieve the same thing with an internal bias.
So we have two things going on in this space right now MLLabelUtils.jl (@Evizero) which works for (mostly for) Matrixish data, and JuliaDB/StatsModels schema (@shashi , @dave.f.kleinschmidt ).
Very roughly speaking:
Both of them currently talk as input the data source, from which they attempt to work out the current representation of the data and if it is Ordinal/Categorical/Continous/etc,
and some optional hints as to what the current form, and the desired output form is.
From this they produce an output representation, that they think is what is going to be useful.
However, they do not know what you are going to do with the data, so they can’t do it perfectly – making the user have to do more work to provide hints.
Hints are not too bad when they are to deal with algorithmic details – there will always be scope for applying some expertise driven decisions here; but when it is row-major vs column major type implementation details that bugs me.
I propose an API that take as input the data source,
the hints, and the data sink (e.g. the fitable thing like a classifier being trained).
And we apply traits to data sinks in order to inform about what they want.
These traits are used to help fill in the hints defaults more intelligently.
Here is for example a system that does something along these lines for purposes of solving row major or column major
tl;dr
We need a schema-ish thing, that that take a source, a sink, and some hints,
that knows how to turn matrixish things and tableish things into a form that that sink can use for fitting and transforming.
NB: I am talking in very broad and rough strokes.