Random access to rows of a table

I’d like to see certain enhancements to the tabular data ecosystem, and this post is an appeal to maintainers of table-providing packages (eg, DataFrames) to articulate the form they should like these to take, or give other feedback.

A fundamental operation in machine learning is a the extraction of some arbitrary subset of observations in the training set (resampling). Providing efficient random access to observations, wherever possible, is therefore crucial.

The Tables.jl interface is very widely used in Julia data science. Unfortunately, it only provides iteration over rows (=observations) of a table. So all ML tooling that is designed to work with arbitrary tables that implement the interface are currently stuck with generally inefficient resampling implementations, even for in-memory data. I am aware of several unfortunate workarounds to this issue, not limited to my own contributions to them!

There have been several requests at Tables.jl to add random access methods to the API (with the obvious slow fallbacks) but these have not met with success, as the maintainers understandably wish to limit the scope of the project.

On the other hand, the older LearnBase.jl project provides a well-thought out interface for data containers supporting random access to obervations (more general than tables, eg, a collection of image files). MLDataPattern.jl built on top of that to provide a lot of functionality for resampling in ML (eg, stratified CV). The very nice package DataLoaders.jl also builds on the LearnBase.jl interface to manage data that does not fit into memory. DataLoaders.jl is widely used by the deep learning community (Flux users, FastAI, etc).

Efforts by some in the ML community are underway to re-organize and re-vitalize the LearnBase API. If this interface could play well with tables, this could help to unify disparate efforts in the julia stats/ml community.

While including tables in the above efforts might be possible, I expect a better option is to extend the Tables.jl interface in a new standalone, lightweight package providing extra methods for row (and other) random access methods. The idea is that existing tables with better-than-iteration random-access implement the new methods natively. Would table-providers be prepared to get behind such an effort? How should the API look to get maximum buy-in? Do people have other ideas for achieving the same goals?

@darsnack @samuel_okon

10 Likes

Sorry, I am not sure if I understand correctly, each column of table is a Vector, they support random access. Do you want a convenience function that accesses the same index in all columns and return the result?

1 Like

I want efficient row-slicing: Given table X and abstract vector of indices I, chosen from 1:nrows(X), I want a new table X2 such that the ith row of X2 is the I[i]th of X. I don’t require that X2 have the same type as X, but iterating over the rows of X2 should generate objects of the same type as iterating of those of X (iteration in the sense of Tables.jl). Or something close to that.

Basically, I want an interface for tables which puts them into the “data container” framework described in great detail here, where for tables “observation” = “row”.

3 Likes

And I don’t want to have to materialise entire columns for tables for which that is costly, eg so-called row tables, which are vectors of named tuples.

2 Likes

I’m probably misunderstanding one of the constraints

julia> df = DataFrame(a=repeat(1:300, 1000), b=repeat(1:6, 50000))
julia> selection = rand(size(df,1)) .> 0.2
300000-element BitVector:
...
julia> @time subset(df,  :a=>(a)->selection)
  0.047807 seconds (33.14 k allocations: 7.502 MiB, 90.23% compilation time)
240106×2 DataFrame

julia> @benchmark subset(df,  :a=>(a)->selection; view=true)
BenchmarkTools.Trial: 3950 samples with 1 evaluation.
 Range (min … max):  941.200 μs …   7.278 ms  ┊ GC (min … max):  0.00% … 80.31%
 Time  (median):       1.029 ms               ┊ GC (median):     0.00%
 Time  (mean ± σ):     1.259 ms ± 978.274 μs  ┊ GC (mean ± σ):  14.66% ± 15.18%

  █▇▅▃▁                                                     ▁   ▁
  ███████▇▆▇▃▄▅▅▁▁▁▄▃▁▃▄▁▁▁▁▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃▄▁▅▇███ █
  941 μs        Histogram: log(frequency) by time       6.13 ms <

 Memory estimate: 1.88 MiB, allocs estimate: 158.

subset is a function from DataFrames for DataFrames. Not all Tables-compatible types that support random access are DataFrames, and I think most consumers of tables (such as MLUtils) would prefer not to take a dep on DataFrames.jl.

1 Like

I’m not sure it has what you want, but maybe this could serve as inspiration for a row-based alternative?

I needed to do just this. Simply create a vector of the row indices of the table as rowidx. Before each pass of training just shuffle!() it. Very fast. Then write your training loop for i in rowidx.

X = rand(9, 10_000) # training data--fake because it wouldn't be rand
rowidx = collect(1:size(X, 1))
shuffle!(rowidx)
for i in rowidx
     # do the training
end

This pseudo code is not a complete representation of training obviously, but it provides a fast way to reorder or sample the training matrix. Because the entire vector rowidx is shuffled, you can do a sample with a subset of rowidx: for i in rowidx[1:1_000]. This does not repeat the first 1000 samples of X because of shuffle. Some items might reappear because of randomness.

Would this work for you?

I think the package that is closest to what you are looking for is AxisArrays.jl. It is very fast at getting random slices of data. The biggest thing it doesn’t fit is it is not a Tables compatible.

I think @ablaom is looking for a more general interface rather than a specific type

Related work: added TableDataset by manikyabard · Pull Request #26 · FluxML/FastAI.jl · GitHub

2 Likes

Yes, I’m asking about a general interface, not a specific table type.

The Tables.jl API allows one to iterate over the rows of any table type on this list using the same syntax. I’m looking for the same functionality for row slicing. For any table, I want to row slice using a common syntax. For specific table types that buy into the interface, slicing will be as efficient as possible for that type.

Thanks to all for the comments so far.

6 Likes

Reference to previous discussions in Tables.jl: Random access to rows for some in-memory sources · Issue #48 · JuliaData/Tables.jl · GitHub, Is there a Tables.jl equivalent of `DataFrames.jl`'s `view(df, rows, :)` · Issue #123 · JuliaData/Tables.jl · GitHub.

What kind of performance would you expect from such a function? For example, random access of rows in a DataFrame is relatively fast, but if that’s a time-critical part of your algorithm, you’d probably better convert the table to a named tuple of columns or to a matrix. So what the API should look like depends on the use case.

IMO this kind of thing would deserve living in Tables.jl even if only a subset of table types can support it. I don’t think @quinnj was totally opposed to that when it was discussed, but nobody has proposed a concrete API for inclusion in Tables.jl so far.

Thanks @nalimilan for your valued input, and for linking in the Tables discussion. My apologies for misrepresenting the position of Tables.jl. I am remembering (or misremembering :flushed: ) a different disucussion. I agree an extension within Tables.jl would be ideal.

I want the best performance the particular table I am using has to offer, assuming the provider of the table type has implemented the new interface, and otherwise, it should just work.

I don’t understand “So what the API should look like depends on the use case.” Could you elaborate?

I’m not sure if this is what you are getting at, but the reasonable observation that models should just convert tables to the optimal representation for their purposes does not preclude the usefulness of the proposal. For one thing, an operation like resampling might occur external to the model, where changing the representation of the data may be premature. The operation of splitting a table into test observations and training observations is a simple example. I don’t want to be forced at this point of the work-flow to make a decision about the best representation of the data. Indeed, I may be comparing multiple models which have contradictory optimal requirements (KNN wants a matrix where observations are columns, while RandomForest is better if observations are rows). The ultimate choice of representation, and when to change it, could be a complex one, but one useful option is “give me a slice of observations in my table, but keep the representation unchanged if possible” and do it as fast as you can, please.

It’s true that models can be made to take more responsibility for resampling. There is an attempt to do this in MLJ. But I haven’t seen this more generally.

One final related thought: I don’t have benchmarks to prove it, but I’m guessing that there are cases (eg, ensemble bagging) where the overheads for poor resampling are getting close to the cost training the model, because that model is very simple (eg, a shallow decision tree).

2 Likes

maybe MLJ can use the Home · Tables.jl ?

and hope the data source (I have a lazy data source now actually…) implement efficiency partition?

notice some source simply prohibit efficient random access, for example when columns are stored in small chunks on disk, individually compressed. (like Apache Parquet)

For example, for a DataFrame, if you just want to split the data in a few subsets, this can be done efficiently by indexing the DataFrame object with a vector of indices; so DataFrames could implement that generic API by simply calling its getindex method. Even for other table types which are not as suited to random access as DataFrame (like SQL tables), indexing with a vector could probably be implemented with an acceptable performance if you only take a few subsets.

On the contrary, if what you want is index single rows repeatedly in a performance-critical part of the code, then maybe you’d better work on a named tuple of columns (or a matrix) than on a DataFrame; the API would have to include e.g. a fastindexingtable function that you would call to get that named tuple of columns before performing repeated indexing. Likewise, for types such as SQL tables, the only solution would be to copy everything in memory first.

From your description, it seems you care about the first case? Indeed it seems the simplest to support.

I think that the request makes sense, but I would like to better understand the problem.

My current understanding is:

  • Tables.jl provides Tables.rows method that guarantees row iterator;
  • Base Julia defines AbstractVector which provides random access API.

Therefore my current workflow is the following if I need random access to rows:

  1. get Tables.rows object from a table;
  2. check if it is an AbstractVector; if it is then we are done;
  3. if it is not it means that the source table explicitly opts out from providing random access (for whatever reason); then I need to collect what Tables.rows returns to have an object supporting random access.

For example if your table is a data frame then by using Tables.rows you get an AbstractVector.

I assume you would want to improve over this workflow, but could you please specify concretely how?
(in particular note that we have to respect the fact that some table type might explicitly opt-out from providing random access and then collect or similar option would be required anyway)

Having said that we might review current implementations of Tables.rows method to make sure it returns AbstractVector in cases it makes sense to do so.

3 Likes

That is a super helpful comment @bkamins, thanks!

I sympathize with the issues that @ablaom raised here, specially the one where consumers of the Tables.jl API need to be very explicit about their intentions and convert between different table types too early in the middle of pipelines. It would be nice if we had a set of traits to determine when and how to do things, perhaps even higher-level functions that rely on the current low-level API.

On the other hand, I also sympathize with the idea that some table types aren’t suited for random access. Perhaps we could make this explicit with a trait function? Something that could help consumers do the best they can with a given table not implemented by them. Documentation is one part of the story, but I guess we need something more programmatic to more easily get top performance.

@juliohm - the point is that I believe we already have this trait and it is AbstractVector. If you want fast random access you essentially need to support this interface (with getting index, making views etc.).

The only challenge might be that some providers of Tables.jl tables might subtype from something else than AbstractVector and Julia does not support multiple inheritance currently.

Therefore I have written my comment to kind of “challenge” maintainers of the packages and learn if anyone has a problem with subtyping from AbstractVector. If there is no such problem then we are done and the only thing package maintainers need to do is to make sure that Tables.rows returns a subtype of AbstractVector if it can be supported (I believe it should not be problematic - this is what we did in DataFrames.jl).

However, if people comment that Tables.rows must return values that cannot be subtypes of AbstractVector (I doubt it but I might not know something) then we indeed need some trait like Tables.isabstractvector which would return true even for non-AbstractVector types that support AbstractVector interface.

1 Like

Tables are basically a collection of Vectors, I have yet not understood why people think they are not suited for random access.

I certainly do with the table types I maintain in Meshes.jl/GeoStats.jl. There we already have a different parent type that is not an AbstractVector. I think AbstractVector is extremely overused in Julia.

So, if we could have a trait function instead of a parent type, that could help maybe.

That is the column view of tables @Henrique_Becker . Some tables have “infinite” rows and so we can’t randomly access index 100004320 efficiently. We need to iterate until we get there, the data produces the rows on the fly.