Difference between StructArrays and TypedTables

Curious if anyone sees any major differences in practical usage for these packages.

StructArrays can use struct of arrays or array of structs approach.
TypedTables uses namedtuple of arrays and constructs the array of named tuples on the fly.
But, under-the-hood both packages are using NamedTuples.

TypedTables has provided some nicer display overrides to create prettier output. Also provides some additional convenience macros.

Interestingly, under the hood both packages create almost the same type for the data structure. This example is a simple table of one column of Int and a second column of Float64:

TypedTable:

Table{NamedTuple{(:a, :b), Tuple{Int64, Float64}}, 1, NamedTuple{(:a, :b), Tuple{Vector{Int64}, Vector{Float64}}}}

StructArrays using a constructor of arrays:

StructVector{NamedTuple{(:a, :b), Tuple{Int64, Float64}}, NamedTuple{(:a, :b), Tuple{Vector{Int64}, Vector{Float64}}}, Int64} (alias for StructArray{NamedTuple{(:a, :b), Tuple{Int64, Float64}}, 1, NamedTuple{(:a, :b), Tuple{Array{Int64, 1}, Array{Float64, 1}}}, Int64})

StructArrays using a constructor with a predefined struct:

StructVector{v, NamedTuple{(:a, :b), Tuple{Vector{Int64}, Vector{Float64}}}, Int64} (alias for StructArray{v, 1, NamedTuple{(:a, :b), Tuple{Array{Int64, 1}, Array{Float64, 1}}}, Int64})

that predefined struct:

struct v
       a::Int
       b::Float64
end

Basically, very little difference except that StructArrays makes the types a bit more verbose because it includes the alias type. But, starting with a struct the StructArrays can use that as the shorthand for the types of the elements. Also, TypedTables uses a ‘1’ to (I am guessing) to indicate that the index value for the vectors is an Int while StructArrays uses Int64 to explicitly show the index type.

Performance for getting and setting seems to be identical. Both packages advantage column semantics over row semantics with the same issues for updating values in a row (must replace a row, NOT update an element of a row).

Both support the Tables.jl interfaces.

The only significant difference seems to be that TypedTables adds a bit more functionality to work as an “application”, while StructArrays might do a bit more to work as infrastructure–but that’s not clear.

Seems like the packages could be merged into TypedTables, which has a bit more functionality, with the possibility to share the maintenance work and have the Union of capabilities in a single package. This is a philosophical thing about saving some work and reducing duplication of effort–but YMMV, and it’s no big deal to have both.

Indeed, they are pretty similar, and there’s also TupleVectors.jl (Differences with StructArrays.jl? · Issue #21 · cscherrer/TupleVectors.jl · GitHub, I don’t really understand these differences though).

As I understand, the main difference between StructArrays and TypedTables is that SA.jl present themselves as plain arrays, with the SoA storage being an implementation detail. They can contain elements of any type, not just namedtuples, and have show() methods with the same output as plain arrays.
Meanwhile, TT.jl are flat tables. They only support namedtuples as element types, their show() methods behave very differently from arrays, and they also overload functions like map() to return TypedTables instead of arrays.

So it’s unlikely that the two can be merged. What seems totally possible is to use SA.jl as the backend for TT.jl. Maybe, the wrapper implementation would be significantly simpler? I don’t know.

1 Like

But, it looks like SA defaults to namedtuples, even when the input type is defined as struct.

Not sure what you mean by TT being only flat tables. The result types are essentially identical for SA and TT—so they must both be flat tables or neither is.

I have had column types (if that is what you mean by element types—but I think you mean row types?) in TT as array and there is no reason dict would not work.

It would be interesting to try dict of columns, but this would defeat the purpose as it would have to be dict{?, Any}, which would be slow as heck and defeat type stability.

Having the row types come out as mutable Dicts would be convenient, but both SA and TT don’t “transmit” row element changes to the columns—so there’d be no advantage. In both cases you have to replace a row (and the replaced row can obviously contain different values as long as they are of the appropriate types).

I’ll look at TupleVectors to see if it is in any way different or has performance benefits. For my use, the only benefits might be a simpler type definition and more flexible row operations. But, I certainly get by with TT as it is.

What do you mean “defaults to namedtuples”? One can make a StructArray of arbitrary structs:

julia> struct S
       a::Int
       end

julia> SA = StructArray([S(1), S(2)])
2-element StructArray(::Vector{Int64}) with eltype S:
 S(1)
 S(2)

julia> SA[1]
S(1)

julia> SA.a
2-element Vector{Int64}:
 1
 2

By “flat tables” I mean 1-d arrays only, and with elements being namedtuples only. Meanwhile, StructArrays can be n-dim arrays of arbitrary structs:

julia> StructArray([S(1) S(2); S(3) S(4)])
2×2 StructArray(::Matrix{Int64}) with eltype S:
 S(1)  S(2)
 S(3)  S(4)

This isn’t possible with TT.jl.

By defaulting to namedtuples I mean that if you do typeof(), they look the same for both packages.

I’ll look more carefully, but my use case is discrete simulation in which I update 4-6 columns when looping through the row index. Most columns are int or enum and a few are vectors of symbols or vectors of ints. I actually never extract an entire row, so the way these packages lazy assemble or cache rows doesn’t matter.

Best perf comes from passing vectors to the code that updates data because deref’ing the outer container is actually quite slow.

In fact, a vector of vectors might be the best thing using enum for the index of the outer vector to provide an intuitive column name. And overriding base index with Int(col_enum). While this seems like it would be faster than deref’ing a NT or a struct, it would not be faster than a function barrier that passes the columns.