Best input/output format for Julia ML packages

Following a recent discussion on slack, I’d like to raise this point here on discourse for arguments not to disappear. Also, the slack format with many short messages isn’t the most convenient for longer discussions.

Currently, ML packages tend to require/return opaque matrices of numbers, and even the dimensions order is not consistent. See this example by @jling:

A couple alternatives were discussed, each with pros and cons:

  • Arrays with named dimensions, like :observation and :feature instead of unnamed matrices. Such arrays are well-supported in Julia with a lightweight focused package, eg NamedDims.jl.

    This resolves the dimensions ambiguity, and does so in the minimal way possible. However, the matrix format is unlikely to be natural for the enduser who wants to apply predictions to his objects.

  • Collection-of-collections, such as (abstract)vector of observations, with each observation an (abstract)vector of features.

    Less minimal change, but more natural for the user. It’s more likely that there’s a collection of objects at first, which is more convenient to transform to a collection-of-feature collections than to a matrix.

    Many ML algorithms internally require a matrix of inputs, and converting from an arbitrary abstractvector would need an extra copy. This isn’t a major issue in many usecases, but avoiding copies also seems pretty straightforward when needed. Conversion can specialize on Base.Slices (result of eachslice) and just return the parent matrix.

  • Table of objects, with columns representing features.

    Fundamentally similar to the previous approach: many abstractvectors are tables themselves. Tables are more general in one direction (can use non-array table types directly), but less general in others: arrays/collections can have varying number of features for different objects; they are easier to make lazy/outofcore.

I’m only a user of ML libraries myself, and find the second alternative above the most natural, and most likely easiest to make efficient. Basically, for ML algos that take a matrix, the adaptation amounts to:

fit(A::Matrix{<:Real}) = ... original method ...
fit(A::AbstractVector) = fit(stack(A))
# potentially swap these two, depending on the original algo dimension order:
fit(A::ColumnSlices) = fit(parent(A))
fit(A::RowSlices) = fit(permutedims(parent(A)))

I wonder what others think about these alternatives, especially ML package developers.

Note that this post focuses on classic ML for now, in DL/NNs where the situation may (?) be somewhat more complicated.

I think MLJ has a good system for this, standardizing on tables and using ScientificTypes.jl: https://alan-turing-institute.github.io/MLJ.jl/dev/getting_started/#Data-containers-and-scientific-types

except it’s not composable with the data cleaning pipeline other libraries use.

which means you have to do everything in MLJ and some backend has performance problems (with or without GPUs), then you have to re-write the whole data cleaning pipeline in a different way if you were to switch away from MLJ

and for example MLJ doesn’t have XGBoost (it has EvoTree, but EvoTree is not XGBoost and EvoTree can’t do incremental training)

It definitely does have XGBoost, there’s MLJXGBoostInterface, which is basically all I’ve been using from MLJ. I haven’t run into composability issues myself, not really sure what you mean.

1 Like

data cleaning from Flux-circle doesn’t work with MLJ and vise versa, so if say, MLJFlux has performance problem (it does), you have to rewrite literally everything in non-MLJ again

Can’t you do something like this? [code sketch]

data = ...get data...
my_clean!(data)
data = rearrange_data_so_MLJ_is_happy(data) # maybe needed?
my_model = ...create MLJ machine with `data`...
MLJ.fit!(my_model)

Or is this what you mean by

you have to rewrite literally everything in non-MLJ again

?

MLJ uses different packages and functions for data splitting and labeling and augmentation; I can show an example later, no promise thought :smiley:

1 Like

Tables probably work fine for simple cases, but they seem more difficult to generalize: varying number of features, multidimensional datasets, online processing…

1 Like

I think that might be true but you can also pick and choose what to use from it and what not to. Eg I am using my own cross validation code even though MLJ has some (because I want to get the predictions back per fold to do some more evaluation on). To me it doesn’t seem like there’s a fundamental mismatch except to the extent you need to convert things to their api and back to whatever form you want to use it.

2 Likes