Struggling to implement Tables.jl interface for Vector{MyStruct}

Thanks for posting! Hopefully I can help clarify what’s (not) needed here. I can see why this isn’t super clear, but I’ll try to point out the relevant parts of the docs along the way:

One of the key design principles of the Tables.jl interface is that providers/sources only implement what is natural, and consumers only call what’s natural. So right from the get-go, your “table” is row-oriented, so you don’t need to think about providing column-access, only row-access.

The relevant interface functions for row-access are:

Tables.istable
Tables.rowaccess
Tables.rows

Fallback definitions for Tables.istable and Tables.rowaccess say that any iterable can be assumed to be a table and provide row-access, so check: your Vector{MyStruct} is iterable, so will be assumed to be a table and provide row-access. Relatedly, the default Tables.rows definition will basically return the input, though a check will be made that the iterator does actually iterate “rows”.

So boom, your Vector{MyStruct} already satisfies the first three automatically via fallback definitions (if you didn’t want the validation check, you could define Tables.rows(x::Vector{MyStruct}) = x).

The 2nd part we need to satisfy is that our “row table” actually iterates “rows”; the relevant docs for this are here. Essentially, we need to define:

Tables.getcolumn(row, i::Int)
Tables.getcolumn(row, nm::Symbol)
Tables.columnnames(row)

But hey, let’s take a look at the default definitions:

Tables.getcolumn(row, i::Int) = getfield(row, i)
Tables.getcolumn(row, nm::Symbol) = getproperty(row, nm)
Tables.columnnames(row) = propertynames(row)

which happen to be exactly what you want for MyStruct! That is, MyStruct already satisfies the AbstractRow interface via the default definitions!

Wait, so Vector{MyStruct} is already a table?? By default?! Yes! Let’s see it in action:

julia> using DataFrames, Tables, Parquet
[ Info: Precompiling Parquet [626c502c-15b0-58ad-a749-f091afb673ae]

julia> struct MyStruct                                                                 
               a::Float64                                                              
               b::Float64                                                              
       end

julia> t = [MyStruct(1, 2), MyStruct(3, 4)]
2-element Array{MyStruct,1}:
 MyStruct(1.0, 2.0)
 MyStruct(3.0, 4.0)

julia> DataFrame(t)
2×2 DataFrame
│ Row │ a       │ b       │
│     │ Float64 │ Float64 │
├─────┼─────────┼─────────┤
│ 1   │ 1.0     │ 2.0     │
│ 2   │ 3.0     │ 4.0     │

Boom! We can automatically transform Vector{MyStruct} into a DataFrame, and specifically without needing to define anything “column” related to Vector{MyStruct} (and coincidentally w/o defining anything, which is, in fact, by design :slight_smile: ). This works because as was noted, Tables.jl wants providers to only need to implement what is natural for them, and not have to jump through weird hoops, or implement boiler plate code just so columns and rows can talk to each other. Tables.jl itself provides the most efficient “fallback” definitions for transforming rows => columns and vice versa. In this case specifically, Tables.jl defines a Tables.buildcolumns routine that will iterate row tables and “build up” column vectors that column consumers can use. What that means is that DataFrames.jl, as a consumer, doesn’t need to do anything different for row table inputs vs. column table inputs. All it does is call Tables.columns(x) and it will get columns back, regardless of whether the input is row or column oriented.

Now, we still have the original question of how to get this all to work with Parquet.jl. Given our Vector{MyStruct} is already a table, it should just work, right?

julia> write_parquet("/home/myuser/lala.parquet", x)
ERROR: AssertionError: Tables.columnaccess(tbl)
Stacktrace:
 [1] write_parquet(::String, ::Array{MyStruct,1}; compression_codec::String) at /home/chronos/user/.julia/packages/Parquet/g6mqp/src/writer.jl:465
 [2] write_parquet(::String, ::Array{MyStruct,1}) at /home/chronos/user/.julia/packages/Parquet/g6mqp/src/writer.jl:465
 [3] top-level scope at REPL[76]:1

Oh shoot! It seems that Parquet.jl is being a bit too opinionated about the table inputs it accepts (code here). What they should define is just tbl = Tables.columns(x) and it will “just work” for any column or row oriented input. Which I’ve proposed they do here.

Hope all that helps!

13 Likes