Following a recent discussion on slack, I’d like to raise this point here on discourse for arguments not to disappear. Also, the slack format with many short messages isn’t the most convenient for longer discussions.
Currently, ML packages tend to require/return opaque matrices of numbers, and even the dimensions order is not consistent. See this example by @jling:
A couple alternatives were discussed, each with pros and cons:
-
Arrays with named dimensions, like
:observation
and:feature
instead of unnamed matrices. Such arrays are well-supported in Julia with a lightweight focused package, egNamedDims.jl
.This resolves the dimensions ambiguity, and does so in the minimal way possible. However, the matrix format is unlikely to be natural for the enduser who wants to apply predictions to his objects.
-
Collection-of-collections, such as (abstract)vector of observations, with each observation an (abstract)vector of features.
Less minimal change, but more natural for the user. It’s more likely that there’s a collection of objects at first, which is more convenient to transform to a collection-of-feature collections than to a matrix.
Many ML algorithms internally require a matrix of inputs, and converting from an arbitrary abstractvector would need an extra copy. This isn’t a major issue in many usecases, but avoiding copies also seems pretty straightforward when needed. Conversion can specialize on
Base.Slices
(result ofeachslice
) and just return the parent matrix. -
Table of objects, with columns representing features.
Fundamentally similar to the previous approach: many abstractvectors are tables themselves. Tables are more general in one direction (can use non-array table types directly), but less general in others: arrays/collections can have varying number of features for different objects; they are easier to make lazy/outofcore.
I’m only a user of ML libraries myself, and find the second alternative above the most natural, and most likely easiest to make efficient. Basically, for ML algos that take a matrix, the adaptation amounts to:
fit(A::Matrix{<:Real}) = ... original method ...
fit(A::AbstractVector) = fit(stack(A))
# potentially swap these two, depending on the original algo dimension order:
fit(A::ColumnSlices) = fit(parent(A))
fit(A::RowSlices) = fit(permutedims(parent(A)))
I wonder what others think about these alternatives, especially ML package developers.
Note that this post focuses on classic ML for now, in DL/NNs where the situation may (?) be somewhat more complicated.