There are various efforts in the tables ecosystem to create a common language for table operations:
I couldn’t find in any of these efforts an operation to select rows of a table based on indices in a lazy fashion. Is there something conceptual that I am missing about taking views of tables in the vertical dimension? Why none of these efforts provides a
view(table, rows, cols) as a generalization of
I mean DataFrames.jl already works. If anything it would be in Tables.jl
df = DataFrame(A = 1:3, B = [2, 1, 2])
@view df[1,2] #this already works
If you are returning columns with
Tables.getcolumns you don’t have to return a materialized column, just any object that can be indexed. Is that what you mean?
At least in the case of Tables.jl/TableOperations.jl (and Query.jl actually), their interfaces are “more generic” than allowing row sub-setting. All you have to provide are row iterators, which means the table doesn’t (and can’t in some cases, e.g. w/ forward-only row streaming tables) have to implement arbitrary random-access indexing/view capabilities. I’ve said elsewhere that I think it’d be useful to have an IndexableTables.jl interface that added this requirement (i.e. tables must support random-access) and then you could define a generic row-indexing/view operation on any IndexableTable. Just takes someone to sit down and write out all the details, and help do some implementations to get something working.
Personally, I find I can do the work I’m interested in without needing views over rows, so I haven’t been highly motivated to work on such an interface. I find I can always change my data processing operations to either work on a stream of rows (i.e.
Tables.rows), or materialized as entire columns which do support indexing in the Tables.jl interface (i.e.
Tables.columns returns a set of indexable columns).
Thank you all. I think the issue was pointed out perfectly by @quinnj: we lack a indexable table interface for which views over rows are permitted.
In my applications the tables are always finite and we know a priori the number of rows. They are not necessarily loaded into memory, but we would like to query the i-th element in the database as needed for some local computation.
For now, I think I will have to materialize the table as columns
Tables.columns and then slice the columns vertically.
This is what I’ve done with DimensionalData.jl for similar reasons. Tables.getcolums are lazy
AbstractVectors so I don’t have to allocate full columns for each dimension index. The array data is just reshaped with
vec as it’s in-memory.
What will be interesting is doing this with disk-based data in GeoData.jl. It would be good to hook into the chunking system in DiskArrays.jl to read out the table to match the disk chunks.
What are the data sources you want to load lazily? I’m assuming this is for GeoStats.jl
The requirements I have in mind have to do with the fact that many partitioning algorithms should return lazy views of the table rows as opposed to copying chunks of the table in different spatial regions. Later on, this “lazy loading from disk” issue will also be important, but I am just trying to get the internal algorithms working without copies of the properties.
I will try to wrap the internal table of the spatial data in GeoStats.jl in a struct, and then trigger the slice on
Tables.columns when necessary.
This is the function I introduced temporarily for the views:
# helper function for table view
function viewtable(table, rows, cols)
t = Tables.columns(table)
v = map(cols) do c
col = Tables.getcolumn(t, c)
c => view(col, rows)
It returns a “column table” according to the Tables.jl docs. It is lazy in the sense that I am taking views of columns and returning these views in a named tuple. The function starts by assuming that
Tables.columns is available, i.e.
Tables.columnaccess(table) == true. It can be made more robust by treating the other case with
Tables.rowaccess(table) == true where one could collect the rows and return a view.