Out of the three options you have listed @view df[row_indices, :]
is fastest and Tables.rows(df)[row_idicees]
is currently slowest (we have discussed with @nalimilan that we can make it faster if Tables.rows
gets a more widespread use).
To my understanding the discussion above concentrated on Tables.rows(df)[row_idicees]
because:
- you asked about efficient random access to observations in the original post (not to randomly subsetted features)
- in this case we can easily agree on an uniform syntax (relying on
getindex
).
The problem with df[row_indices, :]
or @view df[row_indices, :]
in data frames is that if instead you have e.g. a NamedTuple
of vectors it cannot support such indexing. And the converse - the indexing that NamedTuple
supports cannot be supported by DataFrame
.
An additional difficulty, which is seen in the df[row_indices, :]
or @view df[row_indices, :]
example is that it then should be established if the returned object should be a copy or a view of the source.
Given these, to my understanding, the current state of access to features in a table is that you call Tables.columns
on a table and the returned object guarantees you have access to columns that are 1-based indexable and have known length. Then it is up to the user to decide how user wants to work with the features (make a copy, make a view, or just use them as they are).
Given these considerations I propose we start with a discussion what you want. I understand that the core of your request is:
Indexing a column-based table ( Tables.columnaccess(X) == true
) returns a column-based table
Such a requirement cannot be currently met because tables do not have to support indexing (and I think it is unlikely that we can add a requirement that they could support indexing). To see this consider the following tables:
julia> t1 = [(a=1, b=2), (a=3, b=4)]
2-element Vector{NamedTuple{(:a, :b), Tuple{Int64, Int64}}}:
(a = 1, b = 2)
(a = 3, b = 4)
julia> t2 = (a=[1, 3], b=[2, 4])
(a = [1, 3], b = [2, 4])
They are both tables, but have a different indexing scheme and this cannot be fixed (and some other table types might not support indexing at all).
I propose that in order to move forward could you propose the exact API you would imagine would be most useful for you. In particular it should take into account the fact that (maybe the list should be longer but these are two aspects that I currently see as important):
- some tables are row-oriented and other are column-oriented;
- if you want a copy or a view.
Then the question will be if the API you ask can be already built using the low-level primitives that Tables.jl supports (so that for example MLUtils.jl can define it) or we need to make an addition to Tables.jl (or maybe even DataAPI.jl) API to make what you want convenient and efficient.