I’ve had several people direct message me over the last couple of days asking for details about a new package I’ve been working: Tables.jl, so I thought I’d write up a quick post on what it is, where it came from, and where it’s going.
There’s been talk for several years now of having an “AbstractTables.jl” package that would define the definitive common table interface for julia. Most efforts have died out pretty quickly or lost traction and were abandoned (I count at least 5 of these efforts when googling “AbstractTables.jl”). As the primary author of several data-table-format related packages myself, I’d been interested, but not necessarily pushing for these efforts. If they happened and worked out, great, I’d use them, but it wasn’t a top priority.
After JuliaCon 2015, however, I became motivated to create a kind of “table interface”, to help manage the various integrations of data packages that were cropping up (SQLite.jl, CSV.jl, DataFrames.jl, ODBC.jl, MySQL.jl, Feather.jl, etc.). To allow them to all talk to each other (i.e. read in any format, read out to any format), there had to be some kind of common interface these packages could speak to avoid having to define 1-to-N IO operations for each package. Thus DataStreams.jl was born; it focused on low-level interfaces for Data.Source
and Data.Sink
types that could optimize for performance, no matter the data format. While not the simplest interface to implement, it was a powerful set of abstractions that allowed a number of packages to “hook in” to talking with each other w/ a focus on performance.
About a year later, @davidanthoff first released Query.jl, which aimed to provide LINQ-like iterator transforms and a custom “data DSL” for powerful data manipulation in Julia. While the Query.jl framework generically operates on any iterator, a prime focus is on performing transforms on table-like structures (e.g. grouping a DataFrame). So as a sort of sub-interface to Query.jl, IterableTables.jl was born to help table types have a common interface for interacting w/ the Query framework. The IterableTables interface (now technically TableTraits.jl), is extremely simple: define a way to iterate NamedTuples for your table and that’s it for getting in to the Query framework. To act as a sink was a tad more involved, but again, it was really just a matter of defining a way to construct your table type from an iterator of NamedTuples.
Fast forward to 2018: David and I have had several discussions on how to perhaps join forces in the “table interfaces” world to make things easier on the data ecosystem. Instead of needing to implement DataStreams and/or TableTraits, it’d be ideal for there to be just one, common table interface that allowed the benefits of any number of packages. Enter Tables.jl.
Tables.jl provides a common table interface, very similar in spirit to TableTraits, but with a few twists and tweaks to allow better low-level optimizations, in particular by utilizing some brand new features in Julia 1.0. The interface itself is simple:
A table type implements:
-
Tables.schema(x::MyTableType)
that returns aNamedTuple{names, types}
that describe the column names and element types of the table -
Tables.rows(x)
and/orTables.columns(x)
to return the table’s data as aRow
-iterator, or collection of columns - If the table type is columnar (i.e. is more naturally accessed column-by-column than row-by-row), then it can also define
Tables.AccessStyle(x::MyTableType) = Tables.ColumnAccess()
And that’s it on the source side. Now, it’s worth clarifying the expected return types of Tables.rows(x)
and Tables.columns(x)
. For Tables.rows(x)
, I said the return type is a Row
-iterator. What is a Row
? It’s any type where values are dot-accessible via getproperty(obj, name::Symbol)
. So…literally any struct
or mutable struct
in julia. Also NamedTuples implicitly satisfy this. The reason for generalizing the iterator type from NamedTuple (like TableTraits) to a more abstract Row
is that it actually allows pretty significant optimizations by not requiring a full row to be materialized. Two examples of this:
- DataFrames.jl already defines a
DataFrameRow
type which simply wraps the original DataFrame, plus anInt
which describes which row thisDataFrameRow
represents. This is extremely efficient because it essentially acts as a 1-row “view” into the original DataFrame and while iterating, there are essentially no extra allocations. If a DataFrame was required to fully materialize a new NamedTuple w/ each row iteration, it would be much more costly for not much more benefit. ForTables.rows(x)
users, in conjunction with the output ofTables.schema(x)
, they fully know the available properties that can be called directly on theRow
object itself. - CSV.jl: a long-standing desire of mine was to figure out a way to very efficiently allow reading only a subset of columns from a CSV file. There are several nuances that make this tricky in general, but with the new
Tables.rows(x)
Row
-iteration, and a modified file parsing initialization, it can now be possible! The optimization comes into play in that aCSV.File
object can iterate a similar kind of “row view”, which doesn’t require any text parsing or materialization of values until an explicitgetproperty(row, name)
is called. So for a 500-column wide dataset that is used likeCSV.File("very_wide_dataset.csv") |> @map({_.column1, _.column2}) |> DataFrame
, Columnscolumn1
andcolumn2
are literally the only columns where values are parsed from the entire file. This can lead to some very substantial performance gains.
(It’s worth noting that relying on this Row
-iteration concept would not have been possible pre-1.0; two notable features that are paramount to making this work is getproperty
overloading itself, and the power of constant propagation through function calls, not to mention builtin NamedTuples, temporary struct elimination, Union codegen and storage optimizations, and many more. Many kudos to @jameson, @Keno, @stevengj, @jeff.bezanson and others for making features like these possible to further the data ecosystem in Julia.)
For Tables.columns(x::MyTableType)
, I said the return type is a collection of columns. More specifically, it’s a type where columns, like values for Row
, are dot-accessible via getproperty(columns, columnname::Symbol)
. A “column” can be any iterator, AbstractVector, etc. One example of a valid return type is a NamedTuple of AbstractVectors, like (column1=[1,2,3], columns=[4,5,6])
. Another example is a DataFrame itself, since it defines getproperty(dataframe, columnname)
as apart of its explicit API, so it could literally implement the interface by doing Tables.columns(x::DataFrame) = x
.
On the “sink” side of things (i.e. how can I make my own table type from any other table type), we again follow the spirit of IterableTables, while relying on the power of the explicit interface to help. You simply define a function or constructor that takes any table input (like MyTableType(source)
), and then use the explicit interface functions to create your table type. The powerful thing to note here is that one need not worry really about how the input source table defines Tables.AccessStyle
, because optimized generic fallbacks are provided for Tables.rows
and Tables.columns
. That means that if my table type is columnar, I can just call Tables.columns
to get back a valid collection of columns, even if the input table is Tables.RowAccess()
, because a generic Tables.columns(row_iterator)
is defined to given me columns. Similarly, if my table type is row-based, I can call Tables.rows(x)
, and rows will be efficiently iterated, even for Tables.ColumnAccess()
tables. How simple is it, you say? Well, the DataFrames “sink” constructor is literally defined like DataFrame(x::Any) = DataFrame([collect(u) for u in Tables.columns(x)], collect(Tables.names(Tables.schema(x))))
.
Due to the simplicity and genericity of the interfaces, some users may note that there are two “builtin” table types to Julia itself in the forms of a Vector{<:NamedTuple}
and a NamedTuple{names, Tuple{VarArg{<:AbstractVector}}}
, i.e. a Vector of NamedTuples or a NamedTuple of Vectors. In fact, the Tables.jl interface is defined for these two types as a sort of reference implementation here, as well as the fact that these types can actually be handy sometimes for interactive work. Tables.jl exports rowtable
and columntable
functions that will take any Tables.jl-source type and construct a Vector of NamedTuples or NamedTuple of Vectors, respectively. Do note that these types are explicitly materializing everything, so may not be suitable for optimization-sensitive needs, but should be useful for packages wishing to test interface implementations.
One technical issue a common interface brings up, in particular with regards to Query.jl, is how missing values are handled. Certain transforms in Query.jl require explicit type information so that transforms are inferable over iterations. Traditionally, this has been handled by requiring, via the IterableTables interface, that NamedTuple-iterators that may have missing values actually need to iterate NamedTuples of DataValues. In Base and much of the rest of the data ecosystem, however, a more natural representation for missing values is the Missing
type in conjunction with Julia’s builtin Union types (like Union{Int, Missing}
). Luckily, the design of Tables.jl makes it such that it’s extremely cheap and efficient to wrap any Row
iteration in DataValues on demand (i.e. when needed by Query.jl), and even unwrap those DataValues when collected into sinks. This is all builtin to the Tables.jl package through optional require dependencies and “just works” without users needing to do anything explicit.
So what’s the status of Tables.jl? It’s currently in the process of being registered. Check out the repository here, which includes full docs and tests w/ examples. There are several packages w/ WIP branches to use it (CSV.jl, DataFrames.jl, JuliaDB via IndexedTables.jl, SQLite.jl, ODBC.jl, TypedTables.jl, to name a few).
So go give it a spin! If you do Pkg.develop("https://github.com/JuliaData/Tables.jl")
, it will clone the raw repo and if you cd
to the cloned directory and do julia --project
, Pkg.instantiate()
, it will activate and instantiate a custom Tables.jl environment that includes most of the WIP implementations mentioned above. Feel free to report issues or questions/ideas on the github repository (or this thread), and happy tabling everyone!