At the risk of resucitating an old discussion…
I was exploring some of the code in
Clustering and realised that a convention often used in these packages (and in similar ones like
GaussianProcesses) is that if
X is the design matrix then its columns are the observations ("
p x n convention"). The choice seems to have been made some time back to align with the fact that Julia is column-major.
Originally I thought that’s fine, I’ll just transpose the matrix and be done with it but there are a few catches (some of it discussed on Slack already but summarised here hopefully for a wider discussion):
- assuming the algorithm takes an
AbstractMatrix(not always the case but that’s easy to fix), passing a transpose (instead of, say,
copy(transpose(X))) can incur a non-negligible overhead (see for instance this other thread or the small benchmark in this PR both a factor ~3-4 slower)
- DataFrames which I believe we can assume is now a fairly standard way to consume data, when converted to a
n x pconvention so even ignoring any overhead, all dataframe users potentially need to use algorithms with transposes which may feel weird especially for beginners,
- the convention seems inconsistent across packages as far as I’m aware (?) e.g. StatsBase vs Clustering / MultivariateStats
- other well known packages like Sklearn use
n x pthough there’s less of a debate there because they’re row-major
So one way or another, it seems to me that the current situation can lead to less usability and potentially worse performance.
I’m definitely not the first one to bring this up (and I imagine I won’t be the last with the status quo); for instance here’s a nice open issue for Clustering.jl with an acclaimed suggested transition path from 2017; the discussion there seemed to converge towards the “transition towards
n x p while allowing a
vardim keyword for previous behaviour” (like
cov) but it looks like it stayed at the discussion stage.
What are people’s thoughts? should we indeed try to transition throughout the ecosystem to
n x p? or should we just let package devs decide?