I’d like to see certain enhancements to the tabular data ecosystem, and this post is an appeal to maintainers of table-providing packages (eg, DataFrames) to articulate the form they should like these to take, or give other feedback.
A fundamental operation in machine learning is a the extraction of some arbitrary subset of observations in the training set (resampling). Providing efficient random access to observations, wherever possible, is therefore crucial.
The Tables.jl interface is very widely used in Julia data science. Unfortunately, it only provides iteration over rows (=observations) of a table. So all ML tooling that is designed to work with arbitrary tables that implement the interface are currently stuck with generally inefficient resampling implementations, even for in-memory data. I am aware of several unfortunate workarounds to this issue, not limited to my own contributions to them!
There have been several requests at Tables.jl to add random access methods to the API (with the obvious slow fallbacks) but these have not met with success, as the maintainers understandably wish to limit the scope of the project.
On the other hand, the older LearnBase.jl project provides a well-thought out interface for data containers supporting random access to obervations (more general than tables, eg, a collection of image files). MLDataPattern.jl built on top of that to provide a lot of functionality for resampling in ML (eg, stratified CV). The very nice package DataLoaders.jl also builds on the LearnBase.jl interface to manage data that does not fit into memory. DataLoaders.jl is widely used by the deep learning community (Flux users, FastAI, etc).
Efforts by some in the ML community are underway to re-organize and re-vitalize the LearnBase API. If this interface could play well with tables, this could help to unify disparate efforts in the julia stats/ml community.
While including tables in the above efforts might be possible, I expect a better option is to extend the Tables.jl interface in a new standalone, lightweight package providing extra methods for row (and other) random access methods. The idea is that existing tables with better-than-iteration random-access implement the new methods natively. Would table-providers be prepared to get behind such an effort? How should the API look to get maximum buy-in? Do people have other ideas for achieving the same goals?