Random access to rows of a table

Truly infinite tables seems like a niche case, even more niche if their value at some index cannot be computed by a closed formula. I may be wrong about how rare they are but my personal experience is that I have never seen such tables. It would not be the case of having a getindex that is linear in computational effort (for these niche tables, not in general), and note you should design your algorithms with this in mind?

Another more common niche case was thinking about are tables that are finite but too big to fit into memory, but in this case, you yet have something akin to random access, it needs to load a chunk from disk and can be unbearably slow, but if you just need that position it is yet better than starting from the beginning (and loading all the table up to that point.)

1 Like

It is often useful to select a single row or a subset of rows, and a general Tables method would be great!

Currently, even when Tables.rows is a vector, indexing it is not the same: this can require copies even if the table type can provide efficient indexing. And generally it’s more convenient to keep the same table type when selecting a subset of rows.

There are two options when subsetting rows, by the way: a copy or a view. Note that many tables already support random indexing with exactly the same interface, both copy and view:

# subset of rows: returns the same kind of table
julia> tbl[10:20]  # table with rows 10-20, copy of the data
julia> @view tbl[10:20]  # table with rows 10-20, sharing data

# single row:
julia> tbl[10]  # row #10

For example, this works with Vector{NamedTuple}, StructArray, TypedTables, and others.

But while convenient, I think getindex/view is unlikely to become the Tables.jl approach: potentially, a table can do something else when indexing with a number.

Isn’t it opt-in rather than opt-out? For example, the most basic column table, namedtuple of vectors, clearly can be indexed efficiently. However, rows(tbl) isn’t a vector and not indexable:

julia> tbl = (a=[1,2], b=[3,4])

julia> Tables.rows(tbl)
Tables.RowIterator{NamedTuple{(:a, :b), Tuple{Vector{Int64}, Vector{Int64}}}}((a = [1, 2], b = [3, 4]), 2)
2 Likes

This is definitely one use-case we think about for ML stuff. Faster disks are not hard to come by now, and even without them there’s a benefit to not materializing everything in memory first. If theoretically Tables.rows guaranteed no materialization for types that support random access, then that could be the way to go. However that seems way too restrictive to be realistic.

I think there is one crucial issue here. I will relate it to the following comment:

The point is that Tables.rows can return any object type. E.g. in DataFrames.jl when you call Tables.rows(df) where df is a data frame then you get DataFrameRows object that has different properties than AbstractDataFrame. For example:

  • DataFrameRows is a subtype of AbstractVector;
  • DataFrameRows uses a single dimensional index (indicating a row) while AbstractDataFrame uses two dimensional indexing.

For example, the most basic column table, namedtuple of vectors, clearly can be indexed efficiently. However, rows(tbl) isn’t a vector and not indexable:

This is true, and if we agree on the interface I mentioned above the return value of Tables.rows for this argument should be changed.

That is why I think the decision is: either we want to rely on AbstractVector subtyping (this is easier as it does not require introducing new concepts in the ecosystem) or we decide that it is not enough and we need a trait that is not anchored in type system.

Sorry for the slow response on all this; I had a chance to catch up on some of the discussion and started playing around with an idea of what this could potentially look like in Tables.jl; take a look and let me know what you think: Start an idea of what an "in memory" requirement would look like by quinnj · Pull Request #278 · JuliaData/Tables.jl · GitHub.

1 Like

Thanks @nalimilan for your earlier clarification and @bkamins’s follow up comments.

Yes, you are right, we can separate the kinds of row-indexing we want
into to cases:

  1. One in which we create an object for which row-indexing is
    guaranteed and fast (this would be Tables.rows(df)[row_idicees] for
    a dataframe).

  2. One in which we attempt to return an object of similar type (if
    possible) with the selected rows, even if doing so is not
    particularly fast (this would be df[row_indices, :] or @view df[row_indices, :] for a dataframe).

Discussion of solutions above has focused on 1, which is helpful. But
I am also interested in 2. I don’t have a fixed view of what “similar
type” should mean, apart from the following basic requirement (because
it is not guaranteed in case 1): Indexing a column-based table
(Tables.columnaccess(X) == true) returns a column-based table. I
want this because some ML models are more efficient if data is
presented using a column-based table, so I’m not always happy to
destroy that property when I resample (e.g., for
cross-validation).

If only Case 1 indexing is implemented, then I would need to call Table.rows, then index, then convert back to a column table, which would be less efficient, but admittedly might still be better than training the model using a row-based table.

Out of the three options you have listed @view df[row_indices, :] is fastest and Tables.rows(df)[row_idicees] is currently slowest (we have discussed with @nalimilan that we can make it faster if Tables.rows gets a more widespread use).

To my understanding the discussion above concentrated on Tables.rows(df)[row_idicees] because:

  1. you asked about efficient random access to observations in the original post (not to randomly subsetted features)
  2. in this case we can easily agree on an uniform syntax (relying on getindex).

The problem with df[row_indices, :] or @view df[row_indices, :] in data frames is that if instead you have e.g. a NamedTuple of vectors it cannot support such indexing. And the converse - the indexing that NamedTuple supports cannot be supported by DataFrame.

An additional difficulty, which is seen in the df[row_indices, :] or @view df[row_indices, :] example is that it then should be established if the returned object should be a copy or a view of the source.

Given these, to my understanding, the current state of access to features in a table is that you call Tables.columns on a table and the returned object guarantees you have access to columns that are 1-based indexable and have known length. Then it is up to the user to decide how user wants to work with the features (make a copy, make a view, or just use them as they are).

Given these considerations I propose we start with a discussion what you want. I understand that the core of your request is:

Indexing a column-based table ( Tables.columnaccess(X) == true ) returns a column-based table

Such a requirement cannot be currently met because tables do not have to support indexing (and I think it is unlikely that we can add a requirement that they could support indexing). To see this consider the following tables:

julia> t1 = [(a=1, b=2), (a=3, b=4)]
2-element Vector{NamedTuple{(:a, :b), Tuple{Int64, Int64}}}:
 (a = 1, b = 2)
 (a = 3, b = 4)

julia> t2 = (a=[1, 3], b=[2, 4])
(a = [1, 3], b = [2, 4])

They are both tables, but have a different indexing scheme and this cannot be fixed (and some other table types might not support indexing at all).

I propose that in order to move forward could you propose the exact API you would imagine would be most useful for you. In particular it should take into account the fact that (maybe the list should be longer but these are two aspects that I currently see as important):

  • some tables are row-oriented and other are column-oriented;
  • if you want a copy or a view.

Then the question will be if the API you ask can be already built using the low-level primitives that Tables.jl supports (so that for example MLUtils.jl can define it) or we need to make an addition to Tables.jl (or maybe even DataAPI.jl) API to make what you want convenient and efficient.

3 Likes

I have opened https://github.com/JuliaML/MLUtils.jl/issues/67 so that we can discuss all the issues from the whole ML workflow perspective. I hope in this in a way that later we will be able to turn this into PRs to DataAPI.jl or Tables.jl so that ML community gets all what is needed.

2 Likes