Tables.jl: columntable to rowtable

aplavin · March 17, 2020, 8:57pm

I’m trying to implement an interface to a column-based table storage backend, but struggle to make it work efficiently. The minimal implementation looks like this:

import Tables

struct MyTable <: Tables.AbstractColumns
end

Tables.istable(::MyTable) = true
Tables.columnaccess(::MyTable) = true
Tables.columns(table::MyTable) = table

Tables.columnnames(table::MyTable) = [:A, :B, :C]
function Tables.getcolumn(table::MyTable, colname::Symbol)
    sleep(1)  # simulate loading the column from somewhere
    return collect(1:10)
end

It works, but is much slower than should be: i.e., taking the first row takes 4 seconds whereas it could be done in 3:

@time display(MyTable() |> Tables.rows |>  first)

Tables.ColumnsRow{MyTable}:
 :A  1
 :B  1
 :C  1
  4.148632 seconds (246.80 k allocations: 12.727 MiB)

It’s even worse if I try to convert a MyTable to a rowtable - a Vector of NamedTuples:

@time MyTable() |> Tables.rowtable


 31.176221 seconds (338.99 k allocations: 17.856 MiB)
10-element Array{NamedTuple{(:A, :B, :C),Tuple{Int64,Int64,Int64}},1}:
 (A = 1, B = 1, C = 1)   
 (A = 2, B = 2, C = 2)   
 (A = 3, B = 3, C = 3)   
 (A = 4, B = 4, C = 4)   
 (A = 5, B = 5, C = 5)   
 (A = 6, B = 6, C = 6)   
 (A = 7, B = 7, C = 7)   
 (A = 8, B = 8, C = 8)   
 (A = 9, B = 9, C = 9)   
 (A = 10, B = 10, C = 10)

That is, Tables loads every whole column every time it needs to access a row.

The only time-efficient way to convert MyTable to a rowtable I found is going through a DataFrame:

@time MyTable() |> DataFrame |> Tables.rowtable


  3.052622 seconds (129.00 k allocations: 6.975 MiB)
10-element Array{NamedTuple{(:A, :B, :C),Tuple{Int64,Int64,Int64}},1}:
 (A = 1, B = 1, C = 1)   
 (A = 2, B = 2, C = 2)   
 (A = 3, B = 3, C = 3)   
 (A = 4, B = 4, C = 4)   
 (A = 5, B = 5, C = 5)   
 (A = 6, B = 6, C = 6)   
 (A = 7, B = 7, C = 7)   
 (A = 8, B = 8, C = 8)   
 (A = 9, B = 9, C = 9)   
 (A = 10, B = 10, C = 10)

Even faster than getting just the first row using Tables API!

So, my question is - am I missing something here? How it is supposed to work efficiently?

aplavin · May 3, 2020, 2:33pm

Any ideas of what is wrong in my usage of Tables API? How is it supposed to work for such a usecase?

pdeffebach · May 3, 2020, 5:04pm

You need to implement Tables.rows yourself to make this work.

function Tables.rows(m::MyTable)
    Tables.rows(NamedTuple(m))
end

This will bring it down to 3 seconds, just the time it takes to call getcolumn 3 times.

I’m not sure your ultimate goals here, but I think that when you initialize your MyTable type I wonder if you should do all your collecting there and pay a large upfront cost on creating the table rather than pay it over and over again whenever getcolumns gets called.

aplavin · May 3, 2020, 5:34pm

I though about this possibility of loading all the data every time, but in this case I lose the “lazy” access to columns - they will always be loaded, even if I want to access a specific column this time.

pdeffebach · May 3, 2020, 5:53pm

Hmmm… yeah I’m not sure about how to approach this. In theory Tables.jl should work well for lazy tables, but I don’t know how to implement it exactly. I think “lazyness” is more for iteration of rows rather than columns.

My first thought is to have an addcolumn! function which materializes a new column as needed, then have an update! method that searches and updates all the columns from your source as needed. But I don’t have a lot of experience with this.

quinnj · May 4, 2020, 4:42am

This is because the first column is accessed once when Tables.rows is called to determine the length of all columns; then each column is accessed sequentially. So not really a surprise here? If you really wanted this to just be 3 seconds, you could overload Tables.rowcount(::MyTable) directly to avoid the generic definition being called.

This is again correct, as can be seen here. It’s certainly expected that Tables.getcolumn(table, colname) is O(1) and we should enhance the documentation as such.

Again, this is not that surprising, because you’re paying the 3-second cost to access each column once to put inside a DataFrame, and then a DataFrame is doing the normal O(1) iteration through rows with views that is fast.

I’m not totally clear on where the disconnect is here, but I have a guess. If I were making a Tables.jl interface implementation to a columnar-storage format, I would ensure that Tables.getcolumn(table, colname) returned some kind of lazy AbstractArray, thus ensuring getcolumn is very fast, and then I just have to ensure getindex(::MyLazyColumn, i::Int) actually materializes the data. That should allow you to use laziness and still rely on the default Tables.rows definitions provided by Tables.jl. Otherwise, as was suggested, you can just implement your own Tables.rows definition to ensure columns are only materialized once and “row views” are iterated. Happy to help chat strategies here if you can provide additional information or have other questions.

aplavin · May 4, 2020, 7:58pm

Thanks a lot for this thorough explanation! The major part I missed before was basically this:

If I were making a Tables.jl interface implementation to a columnar-storage format, I would ensure that Tables.getcolumn(table, colname) returned some kind of lazy AbstractArray , thus ensuring getcolumn is very fast, and then I just have to ensure getindex(::MyLazyColumn, i::Int) actually materializes the data. That should allow you to use laziness and still rely on the default Tables.rows definitions provided by Tables.jl.

Topic		Replies	Views
Random access to rows of a table Machine Learning	27	2042	March 14, 2022
[ANN] RowTables.jl Data announcement	6	1121	July 26, 2018
Table views Data api , tables	7	1674	August 31, 2020
Tables.jl: a table interface for everyone Data tables	19	10705	November 19, 2018
Help implementing Tables.jl interface Data	5	668	November 19, 2019

Tables.jl: columntable to rowtable

Related topics