Table views

juliohm · August 31, 2020, 1:56pm

There are various efforts in the tables ecosystem to create a common language for table operations:

I couldn’t find in any of these efforts an operation to select rows of a table based on indices in a lazy fashion. Is there something conceptual that I am missing about taking views of tables in the vertical dimension? Why none of these efforts provides a view(table, rows, cols) as a generalization of select(table, cols)?

xiaodai · August 31, 2020, 2:38pm

I mean DataFrames.jl already works. If anything it would be in Tables.jl

using DataFrames

df = DataFrame(A = 1:3, B = [2, 1, 2])

@view df[1,2] #this already works

Raf · August 31, 2020, 2:55pm

If you are returning columns with Tables.getcolumns you don’t have to return a materialized column, just any object that can be indexed. Is that what you mean?

quinnj · August 31, 2020, 3:04pm

At least in the case of Tables.jl/TableOperations.jl (and Query.jl actually), their interfaces are “more generic” than allowing row sub-setting. All you have to provide are row iterators, which means the table doesn’t (and can’t in some cases, e.g. w/ forward-only row streaming tables) have to implement arbitrary random-access indexing/view capabilities. I’ve said elsewhere that I think it’d be useful to have an IndexableTables.jl interface that added this requirement (i.e. tables must support random-access) and then you could define a generic row-indexing/view operation on any IndexableTable. Just takes someone to sit down and write out all the details, and help do some implementations to get something working.

Personally, I find I can do the work I’m interested in without needing views over rows, so I haven’t been highly motivated to work on such an interface. I find I can always change my data processing operations to either work on a stream of rows (i.e. Tables.rows), or materialized as entire columns which do support indexing in the Tables.jl interface (i.e. Tables.columns returns a set of indexable columns).

juliohm · August 31, 2020, 3:15pm

Thank you all. I think the issue was pointed out perfectly by @quinnj: we lack a indexable table interface for which views over rows are permitted.

In my applications the tables are always finite and we know a priori the number of rows. They are not necessarily loaded into memory, but we would like to query the i-th element in the database as needed for some local computation.

For now, I think I will have to materialize the table as columns Tables.columns and then slice the columns vertically.

Raf · August 31, 2020, 3:24pm

This is what I’ve done with DimensionalData.jl for similar reasons. Tables.getcolums are lazy AbstractVectors so I don’t have to allocate full columns for each dimension index. The array data is just reshaped with vec as it’s in-memory.

What will be interesting is doing this with disk-based data in GeoData.jl. It would be good to hook into the chunking system in DiskArrays.jl to read out the table to match the disk chunks.

What are the data sources you want to load lazily? I’m assuming this is for GeoStats.jl

juliohm · August 31, 2020, 3:32pm

The requirements I have in mind have to do with the fact that many partitioning algorithms should return lazy views of the table rows as opposed to copying chunks of the table in different spatial regions. Later on, this “lazy loading from disk” issue will also be important, but I am just trying to get the internal algorithms working without copies of the properties.

I will try to wrap the internal table of the spatial data in GeoStats.jl in a struct, and then trigger the slice on Tables.columns when necessary.

juliohm · August 31, 2020, 5:37pm

This is the function I introduced temporarily for the views:

# helper function for table view
function viewtable(table, rows, cols)
  t = Tables.columns(table)
  v = map(cols) do c
    col = Tables.getcolumn(t, c)
    c => view(col, rows)
  end
  (; v...)
end

It returns a “column table” according to the Tables.jl docs. It is lazy in the sense that I am taking views of columns and returning these views in a named tuple. The function starts by assuming that Tables.columns is available, i.e. Tables.columnaccess(table) == true. It can be made more robust by treating the other case with Tables.rowaccess(table) == true where one could collect the rows and return a view.

Topic		Replies	Views
Random access to rows of a table Machine Learning	27	2044	March 14, 2022
Tables.jl: columntable to rowtable Data	6	1520	May 4, 2020
Defining `getindex` for `Tables.columntable` objects General Usage	2	380	December 11, 2019
[ANN] RowTables.jl Data announcement	6	1121	July 26, 2018
Tables.jl: a table interface for everyone Data tables	19	10715	November 19, 2018

Table views

Related topics