I’m trying to implement an interface to a column-based table storage backend, but struggle to make it work efficiently. The minimal implementation looks like this:
import Tables
struct MyTable <: Tables.AbstractColumns
end
Tables.istable(::MyTable) = true
Tables.columnaccess(::MyTable) = true
Tables.columns(table::MyTable) = table
Tables.columnnames(table::MyTable) = [:A, :B, :C]
function Tables.getcolumn(table::MyTable, colname::Symbol)
sleep(1) # simulate loading the column from somewhere
return collect(1:10)
end
It works, but is much slower than should be: i.e., taking the first row takes 4 seconds whereas it could be done in 3:
@time display(MyTable() |> Tables.rows |> first)
Tables.ColumnsRow{MyTable}:
:A 1
:B 1
:C 1
4.148632 seconds (246.80 k allocations: 12.727 MiB)
It’s even worse if I try to convert a MyTable
to a rowtable
- a Vector
of NamedTuple
s:
@time MyTable() |> Tables.rowtable
31.176221 seconds (338.99 k allocations: 17.856 MiB)
10-element Array{NamedTuple{(:A, :B, :C),Tuple{Int64,Int64,Int64}},1}:
(A = 1, B = 1, C = 1)
(A = 2, B = 2, C = 2)
(A = 3, B = 3, C = 3)
(A = 4, B = 4, C = 4)
(A = 5, B = 5, C = 5)
(A = 6, B = 6, C = 6)
(A = 7, B = 7, C = 7)
(A = 8, B = 8, C = 8)
(A = 9, B = 9, C = 9)
(A = 10, B = 10, C = 10)
That is, Tables
loads every whole column every time it needs to access a row.
The only time-efficient way to convert MyTable
to a rowtable
I found is going through a DataFrame
:
@time MyTable() |> DataFrame |> Tables.rowtable
3.052622 seconds (129.00 k allocations: 6.975 MiB)
10-element Array{NamedTuple{(:A, :B, :C),Tuple{Int64,Int64,Int64}},1}:
(A = 1, B = 1, C = 1)
(A = 2, B = 2, C = 2)
(A = 3, B = 3, C = 3)
(A = 4, B = 4, C = 4)
(A = 5, B = 5, C = 5)
(A = 6, B = 6, C = 6)
(A = 7, B = 7, C = 7)
(A = 8, B = 8, C = 8)
(A = 9, B = 9, C = 9)
(A = 10, B = 10, C = 10)
Even faster than getting just the first row using Tables
API!
So, my question is - am I missing something here? How it is supposed to work efficiently?