I was exploring some of the code in JuliaStats and Clustering and realised that a convention often used in these packages (and in similar ones like GaussianProcesses) is that if X is the design matrix then its columns are the observations (“p x n convention”). The choice seems to have been made some time back to align with the fact that Julia is column-major.
Originally I thought that’s fine, I’ll just transpose the matrix and be done with it but there are a few catches (some of it discussed on Slack already but summarised here hopefully for a wider discussion):
assuming the algorithm takes an AbstractMatrix (not always the case but that’s easy to fix), passing a transpose (instead of, say, copy(transpose(X))) can incur a non-negligible overhead (see for instance this other thread or the small benchmark in this PR both a factor ~3-4 slower)
DataFrames which I believe we can assume is now a fairly standard way to consume data, when converted to a Matrix uses the n x p convention so even ignoring any overhead, all dataframe users potentially need to use algorithms with transposes which may feel weird especially for beginners,
the convention seems inconsistent across packages as far as I’m aware (?) e.g. StatsBase vs Clustering / MultivariateStats
other well known packages like Sklearn use n x p though there’s less of a debate there because they’re row-major
So one way or another, it seems to me that the current situation can lead to less usability and potentially worse performance.
I’m definitely not the first one to bring this up (and I imagine I won’t be the last with the status quo); for instance here’s a nice open issue for Clustering.jl with an acclaimed suggested transition path from 2017; the discussion there seemed to converge towards the “transition towards n x p while allowing a vardim keyword for previous behaviour” (like cov) but it looks like it stayed at the discussion stage.
What are people’s thoughts? should we indeed try to transition throughout the ecosystem to n x p? or should we just let package devs decide?
Re-posting my comment from Clustering.jl#136: I think the best solution to that is to add a kmeans method taking a table (in the sense of Tables.jl) instead of a matrix, and which would internally collect the data in the right format. That would work for DataFrame without taking a dependency on DataFrames.jl.
What function in StatsBase do you have in mind in particular? I think one problem is that the most efficient format depends on the operation: it’s more efficient to use variables as columns to compute the covariance matrix (p × p), but the reverse is true for the distance matrix (n × n).
Adding a vardim keyword everywhere doesn’t seem to be too controversial (as you note), and it has the advantage of making it very clear in every docstring that people need to think about the dimensions.
Apologies I hadn’t seen your comment on the issue before writing this post.
Working with tables definitely seems like a nice and more general approach though does that mean that, internally, packages would end up having to make a copy of the data in their preferred format? I guess that would be fine in most settings but what if the data is very large? (I guess people could use OnlineStats… in some cases)
It’s up to packages to do whatever is most appropriate. If performing the operation with variables as columns is efficient (like for cov), then they can use the streaming interface to avoid a copy. But computing distances requires accessing each observation several times, so it’s faster to make a copy with observations as columns so that later accesses are much faster. An argument could allow choosing between these behaviors.