Struggling to implement Tables.jl interface for Vector{MyStruct}

I’m trying to make Vector{MyStruct} a Table. One of the things that I have in mind is that I’ll be able to save Vector{MyStruct} into a parquet file by just calling write_parquet(file, tbl), which you can see should be possible from looking at the last section of README.md in [1]: “You can write any Tables.jl column-accessible table that contains columns of these types”. My problem is that I haven’t been able to figure out of how make Vector{MyStruct} a Tables.jl column-compatible table, even though I’ve looked at numerous examples.

Now, according to the README.md of Tables.jl, [2] the only thing I should need is to define three functions:

Tables.istable(table) - Declare that your table type implements the interface
Tables.columnaccess(table) - Declare that your table type defines a Tables.columns(table) method
Tables.columns(table) - Return an Tables.AbstractColumns-compatible object from your table

(Optionally there’s a Tables.schema function which I’m also happy to implement)

The first two are straightforward enough:

using DataFrames, Tables, Parquet
struct MyStruct                                                                 
        a::Float64                                                              
        b::Float64                                                              
end
Tables.istable(::Type{<:Vector{MyStruct}}) = true                               
Tables.columnaccess(::Type{<:Vector{MyStruct}}) = true

Now regarding Tables.column, I’m not sure what the phrase “Return an Tables.AbstractColumns-compatible object from your table” means. So I looked to see what DataFrames does:

df = DataFrame(Dict(:a=>[1.0,2.0], :b=>[2.0,3.0]))
Tables.columns(df)

This returns the dataframe itself. I guess it makes sense to think that a DataFrame is already a columnar table in the sense that it I can call getproperty(df ,:a) and get a vector.

So, do I need to implement Tables.columns to return something that has getproperty defined on it? If so, I expected that this would work:

Tables.columns(x::Vector{MyStruct}) = Dict(:a=>[getproperty(x[i], :a) for i in 1:length(x)], :b=>[getproperty(x[i], :b) for i in 1:length(x)]

But in fact it doesn’t:

write_parquet("/home/myuser/lala.parquet", x)
ERROR: type Nothing has no field types
Stacktrace:
 [1] getproperty(::Nothing, ::Symbol) at ./Base.jl:33
 [2] write_parquet(::String, ::Array{MyStruct,1}; compression_codec::String) at /home/myuser/.julia/packages/Parquet/g6mqp/src/writer.jl:470
 [3] write_parquet(::String, ::Array{MyStruct,1}) at /home/myuser/.julia/packages/Parquet/g6mqp/src/writer.jl:465
 [4] top-level scope at REPL[20]:1

So now I suspect that the key bit is that the thing returned must be “AbstractColumns”-compatible. So an alternative is to define a new type, MyStructTable which contains the same data as Vector{MyStruct} but as columns, and then define Tables.getcolumn and Tables.columnnames. So following [2] I’ve tried the following:

struct MyStructTable <: Tables.AbstractColumns                                  
        names::Vector{Symbol}                                                   
        lookup::Dict{Symbol, Int}                                               
        data::Vector{Vector{Float64}}                                     
end

Now I need a constructor to build MyStructTable from Vector{MyStruct}. However, the natural thing throws an error:

MyStructTable(x::Vector{MyStruct}) = MyStructTable([:a,:b], Dict(:a=>1,:b=>2), [[getproperty(x[i], :a) for i in 1:length(x)],[getproperty(x[i], :b) for i in 1:length(x)]])

So I can’t even get the constructor working

julia> MyStructTable(x)
MyStructTable: Error showing value of type MyStructTable:
ERROR: StackOverflowError:
Stacktrace:
 [1] columnnames(::MyStructTable) at /home/myuser/.julia/packages/Tables/okt7x/src/Tables.jl:105
 [2] propertynames(::MyStructTable) at /home/myuser/.julia/packages/Tables/okt7x/src/Tables.jl:165
 ... (the last 2 lines are repeated 39990 more times)
 [79983] columnnames(::MyStructTable) at /home/myuser/.julia/packages/Tables/okt7x/src/Tables.jl:105

Help?

[1] https://github.com/JuliaIO/Parquet.jl
[2] https://github.com/JuliaData/Tables.jl/blob/master/docs/src/index.md

The constructor doesn’t work, but the idea was to then implement AbstractColumns by defining something like:

Tables.getcolumn(m::MyStructTable, ::Type{T}, col::Int, nm::Symbol) where {T} = m.data[col]
Tables.getcolumn(m::MyStructTable, nm::Symbol) = m.data[m.lookup[nm]]
Tables.getcolumn(m::MyStructTable, i::Int) = m.data[i]
Tables.columnnames(m::MyStructTable) = m.names
1 Like

This is a great question and I hope it gets answered by someone more knowledgeable on implementing the Tables.jl interface than me, but your ultimate problem can be solved using StructArrays.jl.

julia> t = [MyStruct(rand(), rand()) for i in 1:100];

julia> d = StructArray(t);

julia> Tables.istable(d)
true

julia> Tables.getcolumn(d, :a);
1 Like

Actually I think I got your origina attempt to work. You definitely don’t need to implement your own type just for row access. you just have to define the right methods.

julia> VM = AbstractVector{MyStruct}

julia> Tables.getcolumn(t::VM, i::Int) = getproperty.(t, (:a, :b)[i])

julia> Tables.istable(t::VM) = true

julia> Tables.columnaccess(t::VM) = true

julia> Tables.getcolumn(t::VM, i::Int) = getproperty.(t, (:a, :b)[i])

julia> Tables.getcolumn(t::VM, nm::Symbol) = getproperty.(t, nm)

julia> Tables.columnnames(t) = (:a, :b)

julia> DataFrame(t);
2 Likes

Hey, thanks for responding.

I knew of StructArrays. The issue I have with that is that it’s really just a hack: You’re using the StructArray constructor to do the work for you. You could equally well have used the DataFrame constructor.

The second solution looks good, but throws the same error as one of my own attempts: ERROR: type Nothing has no field types. I’m pasting below the full code in case someone wants to reproduce:

using DataFrames, Tables, Parquet                                               
                                                                                
struct MyStruct                                                                 
        a::Float64                                                              
        b::Float64                                                              
end                                                                             
                                                                                
x = [MyStruct(rand(), rand()) for i in 1:1000];                                 
                                                                                
VM = AbstractVector{MyStruct}                                                   
Tables.getcolumn(t::VM, i::Int) = getproperty.(t, (:a, :b)[i])                  
Tables.istable(t::VM) = true                                                    
Tables.columnaccess(t::VM) = true                                               
Tables.getcolumn(t::VM, i::Int) = getproperty.(t, (:a, :b)[i])                  
Tables.getcolumn(t::VM, nm::Symbol) = getproperty.(t, nm)                       
Tables.columnnames(t) = (:a, :b)                                                
                                                                                
write_parquet("lala.parquet", x)
ERROR: type Nothing has no field types
Stacktrace:
 [1] getproperty(::Nothing, ::Symbol) at ./Base.jl:33
 [2] write_parquet(::String, ::Array{MyStruct,1}; compression_codec::String) at /home/myuser/.julia/packages/Parquet/g6mqp/src/writer.jl:470
 [3] write_parquet(::String, ::Array{MyStruct,1}) at /home/myuser/.julia/packages/Parquet/g6mqp/src/writer.jl:465
 [4] top-level scope at REPL[33]:1

Compare with

df = DataFrame(x)
write_parquet("lala.parquet", df)

which works.

Your first solution also works:

sa = StructArray(x)
write_parquet("lala.parquet", sa)

it’s just not very pretty :smiley:

The problem is that you can’t really define a non-allocating column-access on your Vector{MyStruct}. That is to say, you shouldn’t do

Tables.columnaccess(::Type{<:Vector{MyStruct}}) = true

From the docs of Tables.columnaccess:

help?> Tables.columnaccess
  Tables.columnaccess(x) => Bool

  Check whether an object has specifically defined that it implements the
  Tables.columns function that does not copy table data [...]

But in your case the columns are not available (at least not for free, your definition allocates).

OTOH it looks like parquet is a columnar format, so you need to convert to a column-based table somewhere. Replacing Vector{MyStruct} with StructVector{MyStruct} is a simple solution. Probably, you could even start working with StructVector{MyStruct} directly (it is still a AbstractVector{MyStruct} after all) and avoid the conversions, chances are your code will remain pretty much the same.

If you are happy with converting back and forth, I guess one could discuss over at Parquet.jl whether it makes sense, in the function write_parquet, if the table does not have a column-access, to convert to one that does, instead of throwing an error.

1 Like

This is a valid point, and one I intend to explore. For my use case it’s not clear which layout is better, but yes, I’ll definitly explore that possibility.

This I don’t understand. I understand that the columns aren’t available for free. But if I’m ok with allocating for the purpose of writing to a parquet file, then why doesn’t pdeffebach’s second solution work? He defined an allocating way of accessing columns, as can be seen by the fact that Tables.columns(x) returns a named tuple of columns and Tables.getcolumn(x, :a) returns column :a.

The way I understand this, the conversion isn’t done by write_parquet, but by Tables.columns, which is defined. I feel like I’m missing something :slight_smile:

Thanks for posting! Hopefully I can help clarify what’s (not) needed here. I can see why this isn’t super clear, but I’ll try to point out the relevant parts of the docs along the way:

One of the key design principles of the Tables.jl interface is that providers/sources only implement what is natural, and consumers only call what’s natural. So right from the get-go, your “table” is row-oriented, so you don’t need to think about providing column-access, only row-access.

The relevant interface functions for row-access are:

Tables.istable
Tables.rowaccess
Tables.rows

Fallback definitions for Tables.istable and Tables.rowaccess say that any iterable can be assumed to be a table and provide row-access, so check: your Vector{MyStruct} is iterable, so will be assumed to be a table and provide row-access. Relatedly, the default Tables.rows definition will basically return the input, though a check will be made that the iterator does actually iterate “rows”.

So boom, your Vector{MyStruct} already satisfies the first three automatically via fallback definitions (if you didn’t want the validation check, you could define Tables.rows(x::Vector{MyStruct}) = x).

The 2nd part we need to satisfy is that our “row table” actually iterates “rows”; the relevant docs for this are here. Essentially, we need to define:

Tables.getcolumn(row, i::Int)
Tables.getcolumn(row, nm::Symbol)
Tables.columnnames(row)

But hey, let’s take a look at the default definitions:

Tables.getcolumn(row, i::Int) = getfield(row, i)
Tables.getcolumn(row, nm::Symbol) = getproperty(row, nm)
Tables.columnnames(row) = propertynames(row)

which happen to be exactly what you want for MyStruct! That is, MyStruct already satisfies the AbstractRow interface via the default definitions!

Wait, so Vector{MyStruct} is already a table?? By default?! Yes! Let’s see it in action:

julia> using DataFrames, Tables, Parquet
[ Info: Precompiling Parquet [626c502c-15b0-58ad-a749-f091afb673ae]

julia> struct MyStruct                                                                 
               a::Float64                                                              
               b::Float64                                                              
       end

julia> t = [MyStruct(1, 2), MyStruct(3, 4)]
2-element Array{MyStruct,1}:
 MyStruct(1.0, 2.0)
 MyStruct(3.0, 4.0)

julia> DataFrame(t)
2×2 DataFrame
│ Row │ a       │ b       │
│     │ Float64 │ Float64 │
├─────┼─────────┼─────────┤
│ 1   │ 1.0     │ 2.0     │
│ 2   │ 3.0     │ 4.0     │

Boom! We can automatically transform Vector{MyStruct} into a DataFrame, and specifically without needing to define anything “column” related to Vector{MyStruct} (and coincidentally w/o defining anything, which is, in fact, by design :slight_smile: ). This works because as was noted, Tables.jl wants providers to only need to implement what is natural for them, and not have to jump through weird hoops, or implement boiler plate code just so columns and rows can talk to each other. Tables.jl itself provides the most efficient “fallback” definitions for transforming rows => columns and vice versa. In this case specifically, Tables.jl defines a Tables.buildcolumns routine that will iterate row tables and “build up” column vectors that column consumers can use. What that means is that DataFrames.jl, as a consumer, doesn’t need to do anything different for row table inputs vs. column table inputs. All it does is call Tables.columns(x) and it will get columns back, regardless of whether the input is row or column oriented.

Now, we still have the original question of how to get this all to work with Parquet.jl. Given our Vector{MyStruct} is already a table, it should just work, right?

julia> write_parquet("/home/myuser/lala.parquet", x)
ERROR: AssertionError: Tables.columnaccess(tbl)
Stacktrace:
 [1] write_parquet(::String, ::Array{MyStruct,1}; compression_codec::String) at /home/chronos/user/.julia/packages/Parquet/g6mqp/src/writer.jl:465
 [2] write_parquet(::String, ::Array{MyStruct,1}) at /home/chronos/user/.julia/packages/Parquet/g6mqp/src/writer.jl:465
 [3] top-level scope at REPL[76]:1

Oh shoot! It seems that Parquet.jl is being a bit too opinionated about the table inputs it accepts (code here). What they should define is just tbl = Tables.columns(x) and it will “just work” for any column or row oriented input. Which I’ve proposed they do here.

Hope all that helps!

7 Likes

This is amazing. Thank you for taking the interest and the time :slight_smile:

I guess it says something about the quality of design of Tables.jl that (if it hadn’t been for Parquet.jl doing something weird) the user had to literally do nothing for it to work :wink: Thank you!

When I wrote parquet writer I did not fully understand the design of Tables.jl.

I thought table column access was to warn us about tables not in columnar format.

I am still alilttle apprehensive abt Tables.columns. So the user could’ve just done write_parquet(path, Tables.columns(mystruc)) ?

so to me, the assertion not only emphasised that parquet is a columnar format it also forces the user to think about if they have an efficient algorithm for converting from rows to columns and not have that conversion “hidden”.