Tables.jl: a table interface for everyone


#1

I’ve had several people direct message me over the last couple of days asking for details about a new package I’ve been working: Tables.jl, so I thought I’d write up a quick post on what it is, where it came from, and where it’s going.

There’s been talk for several years now of having an “AbstractTables.jl” package that would define the definitive common table interface for julia. Most efforts have died out pretty quickly or lost traction and were abandoned (I count at least 5 of these efforts when googling “AbstractTables.jl”). As the primary author of several data-table-format related packages myself, I’d been interested, but not necessarily pushing for these efforts. If they happened and worked out, great, I’d use them, but it wasn’t a top priority.

After JuliaCon 2015, however, I became motivated to create a kind of “table interface”, to help manage the various integrations of data packages that were cropping up (SQLite.jl, CSV.jl, DataFrames.jl, ODBC.jl, MySQL.jl, Feather.jl, etc.). To allow them to all talk to each other (i.e. read in any format, read out to any format), there had to be some kind of common interface these packages could speak to avoid having to define 1-to-N IO operations for each package. Thus DataStreams.jl was born; it focused on low-level interfaces for Data.Source and Data.Sink types that could optimize for performance, no matter the data format. While not the simplest interface to implement, it was a powerful set of abstractions that allowed a number of packages to “hook in” to talking with each other w/ a focus on performance.

About a year later, @davidanthoff first released Query.jl, which aimed to provide LINQ-like iterator transforms and a custom “data DSL” for powerful data manipulation in Julia. While the Query.jl framework generically operates on any iterator, a prime focus is on performing transforms on table-like structures (e.g. grouping a DataFrame). So as a sort of sub-interface to Query.jl, IterableTables.jl was born to help table types have a common interface for interacting w/ the Query framework. The IterableTables interface (now technically TableTraits.jl), is extremely simple: define a way to iterate NamedTuples for your table and that’s it for getting in to the Query framework. To act as a sink was a tad more involved, but again, it was really just a matter of defining a way to construct your table type from an iterator of NamedTuples.

Fast forward to 2018: David and I have had several discussions on how to perhaps join forces in the “table interfaces” world to make things easier on the data ecosystem. Instead of needing to implement DataStreams and/or TableTraits, it’d be ideal for there to be just one, common table interface that allowed the benefits of any number of packages. Enter Tables.jl.

Tables.jl provides a common table interface, very similar in spirit to TableTraits, but with a few twists and tweaks to allow better low-level optimizations, in particular by utilizing some brand new features in Julia 1.0. The interface itself is simple:

A table type implements:

  • Tables.schema(x::MyTableType) that returns a NamedTuple{names, types} that describe the column names and element types of the table
  • Tables.rows(x) and/or Tables.columns(x) to return the table’s data as a Row-iterator, or collection of columns
  • If the table type is columnar (i.e. is more naturally accessed column-by-column than row-by-row), then it can also define Tables.AccessStyle(x::MyTableType) = Tables.ColumnAccess()

And that’s it on the source side. Now, it’s worth clarifying the expected return types of Tables.rows(x) and Tables.columns(x). For Tables.rows(x), I said the return type is a Row-iterator. What is a Row? It’s any type where values are dot-accessible via getproperty(obj, name::Symbol). So…literally any struct or mutable struct in julia. Also NamedTuples implicitly satisfy this. The reason for generalizing the iterator type from NamedTuple (like TableTraits) to a more abstract Row is that it actually allows pretty significant optimizations by not requiring a full row to be materialized. Two examples of this:

  • DataFrames.jl already defines a DataFrameRow type which simply wraps the original DataFrame, plus an Int which describes which row this DataFrameRow represents. This is extremely efficient because it essentially acts as a 1-row “view” into the original DataFrame and while iterating, there are essentially no extra allocations. If a DataFrame was required to fully materialize a new NamedTuple w/ each row iteration, it would be much more costly for not much more benefit. For Tables.rows(x) users, in conjunction with the output of Tables.schema(x), they fully know the available properties that can be called directly on the Row object itself.
  • CSV.jl: a long-standing desire of mine was to figure out a way to very efficiently allow reading only a subset of columns from a CSV file. There are several nuances that make this tricky in general, but with the new Tables.rows(x) Row-iteration, and a modified file parsing initialization, it can now be possible! The optimization comes into play in that a CSV.File object can iterate a similar kind of “row view”, which doesn’t require any text parsing or materialization of values until an explicit getproperty(row, name) is called. So for a 500-column wide dataset that is used like CSV.File("very_wide_dataset.csv") |> @map({_.column1, _.column2}) |> DataFrame, Columns column1 and column2 are literally the only columns where values are parsed from the entire file. This can lead to some very substantial performance gains.

(It’s worth noting that relying on this Row-iteration concept would not have been possible pre-1.0; two notable features that are paramount to making this work is getproperty overloading itself, and the power of constant propagation through function calls, not to mention builtin NamedTuples, temporary struct elimination, Union codegen and storage optimizations, and many more. Many kudos to @jameson, @Keno, @stevengj, @jeff.bezanson and others for making features like these possible to further the data ecosystem in Julia.)

For Tables.columns(x::MyTableType), I said the return type is a collection of columns. More specifically, it’s a type where columns, like values for Row, are dot-accessible via getproperty(columns, columnname::Symbol). A “column” can be any iterator, AbstractVector, etc. One example of a valid return type is a NamedTuple of AbstractVectors, like (column1=[1,2,3], columns=[4,5,6]). Another example is a DataFrame itself, since it defines getproperty(dataframe, columnname) as apart of its explicit API, so it could literally implement the interface by doing Tables.columns(x::DataFrame) = x.

On the “sink” side of things (i.e. how can I make my own table type from any other table type), we again follow the spirit of IterableTables, while relying on the power of the explicit interface to help. You simply define a function or constructor that takes any table input (like MyTableType(source)), and then use the explicit interface functions to create your table type. The powerful thing to note here is that one need not worry really about how the input source table defines Tables.AccessStyle, because optimized generic fallbacks are provided for Tables.rows and Tables.columns. That means that if my table type is columnar, I can just call Tables.columns to get back a valid collection of columns, even if the input table is Tables.RowAccess(), because a generic Tables.columns(row_iterator) is defined to given me columns. Similarly, if my table type is row-based, I can call Tables.rows(x), and rows will be efficiently iterated, even for Tables.ColumnAccess() tables. How simple is it, you say? Well, the DataFrames “sink” constructor is literally defined like DataFrame(x::Any) = DataFrame([collect(u) for u in Tables.columns(x)], collect(Tables.names(Tables.schema(x)))).

Due to the simplicity and genericity of the interfaces, some users may note that there are two “builtin” table types to Julia itself in the forms of a Vector{<:NamedTuple} and a NamedTuple{names, Tuple{VarArg{<:AbstractVector}}}, i.e. a Vector of NamedTuples or a NamedTuple of Vectors. In fact, the Tables.jl interface is defined for these two types as a sort of reference implementation here, as well as the fact that these types can actually be handy sometimes for interactive work. Tables.jl exports rowtable and columntable functions that will take any Tables.jl-source type and construct a Vector of NamedTuples or NamedTuple of Vectors, respectively. Do note that these types are explicitly materializing everything, so may not be suitable for optimization-sensitive needs, but should be useful for packages wishing to test interface implementations.

One technical issue a common interface brings up, in particular with regards to Query.jl, is how missing values are handled. Certain transforms in Query.jl require explicit type information so that transforms are inferable over iterations. Traditionally, this has been handled by requiring, via the IterableTables interface, that NamedTuple-iterators that may have missing values actually need to iterate NamedTuples of DataValues. In Base and much of the rest of the data ecosystem, however, a more natural representation for missing values is the Missing type in conjunction with Julia’s builtin Union types (like Union{Int, Missing}). Luckily, the design of Tables.jl makes it such that it’s extremely cheap and efficient to wrap any Row iteration in DataValues on demand (i.e. when needed by Query.jl), and even unwrap those DataValues when collected into sinks. This is all builtin to the Tables.jl package through optional require dependencies and “just works” without users needing to do anything explicit.

So what’s the status of Tables.jl? It’s currently in the process of being registered. Check out the repository here, which includes full docs and tests w/ examples. There are several packages w/ WIP branches to use it (CSV.jl, DataFrames.jl, JuliaDB via IndexedTables.jl, SQLite.jl, ODBC.jl, TypedTables.jl, to name a few).

So go give it a spin! If you do Pkg.develop("https://github.com/JuliaData/Tables.jl"), it will clone the raw repo and if you cd to the cloned directory and do julia --project, Pkg.instantiate(), it will activate and instantiate a custom Tables.jl environment that includes most of the WIP implementations mentioned above. Feel free to report issues or questions/ideas on the github repository (or this thread), and happy tabling everyone!


Taking Fitting Seriously
DataTables or DataFrames?
#2

Is the long-term goal that this obsoletes DataStreams.jl? Functionality seems like a superset of that currently provided by DataStreams, but I wanted to make sure DataStreams didn’t have utility that I was missing.


#3

What is the status of the design here? Are you asking for feedback on the design, or is this done at this stage?

There are a lot of things that I would do differently with a common table interface (mostly based on my experience of integrating pretty much every table like thing that exists as part of Queryverse.jl over the last two years). I think if this is meant to unify the table story in this space it would be good to give this more time, collect more feedback, iterate on the design etc. before this is pushed out into the wild and into packages. I have very little time in the next two weeks or so, unfortunately, so more time is from my point of view probably the main thing here…


#4

It would certainly be more helpful to expound on the “things that I would do differently”; particularly since the Tables.jl design takes largely its entire design from TableTraits itself (indeed, our conversations together were indeed the inspiration for the package!).

At this point, I feel pretty confident in the overall design as a foundation; I don’t expect this core design/interface to change at all, ever. Are further interface options possible? Certainly. I’m sure we’ll encounter use-cases or performance opportunities that lead to new ideas. But the raw simplicity of the interface (Tables.rows and Tables.columns) seems fundamental enough that anything else can be additions to these core methods.

But always open to feedback and refining of definitions and such as a few open issues at the Tables.jl repo have already started doing (https://github.com/JuliaData/Tables.jl/issues).


#5

This reads super exciting. But, given that this is an interface among different table types, could you help me understand: what is the utility of having different table types in the first place? Is this just intended to be a way of creating coherence in a period where different implementations are tried out and compete for dominance, to end with a single table type - or does Julia actually need different types of data table in the long run?


#6

It certainly needs different table types! “table” data exists in so many different forms in the world: csv, databases, custom software formats (spss, sas), parquet, feather, arrow, avro, etc. in addition to the various in-memory structures across languages (data.frame in R, panda’s DataFrame, etc.). It’s ideal if all of those formats “speak the same language” so that many-to-many integrations are easy & painless.

Even apart from the “IO” story, I think it’s healthy for the ecosystem to foster the development of different in-memory approaches to tabular data. Some table formats will be optimized for different use-cases.

A unified Tables.jl interface helps all of that by letting all these formats & table structures speak the same language so that things like Query.jl “just work”, no matter the input/output type.


#7

Thanks! :smiley: Yes I see the usefulness of having database connections able to coexist with DataFrames; or mere generally persistent, distributed data and more dynamic in-memory data. But aren’t CSV, SPSS, SAS, feather, parquet etc rather storage formats that could all be represented in a Julia session as e.g. a DataFrame?

Oh, I guess if you read feather line-by-line from a file you’d want to be able to process it without copying it into a DataFrame. I get it now, thanks!


#8

And many interesting operations can be applied without ever materializing the dataset: for example all techniques from OnlineStats can in principled be applied while the data is being read a line at a time (and there are a lot of options: taking some summary statistics, fitting a distribution, fitting a GLM…).


#9

We also don’t want to get stuck like R is with a single data representation that we cannot move away from. The built-in dataframe type has served R very well but today it is increasingly common to need to work on data sets that don’t fit in memory on a single machine. The Tables design allows generic table code to work just as well on a distributed out-of-core JuliaDB table as it does on an in-memory DataFrame instance, which is extremely powerful. If someone else has some other clever way to represent tabular data, they can implement the Table abstraction and focus on making the representation as good as possible without having to recreate all the operations people want out of tabular things in general.


#10

I would take it even further. We don’t want to stick with a single data representation at all, in the greater idea of data. For tables, storing by columns is faster for ML which wants to operate on columns while storing by arrays can be more useful for online statistics. Parallelization optimizes for large datasets, but can get in the way with simpler serial designs.

It’s not just tables. For arrays, we don’t want to stick to a single data representation. The standard of course is contiguous arrays in CPU memory, but using GPU or TPU memory is helpful, or using non-contiguous arrays via types in RecursiveArrayTools.jl is better when you need to add/remove portions, but worse in other cases. For arrays, there’s the struct of array format and the array of structs. There’s also numbers. Float32, Float64, BigFloat, etc.: we don’t want to be stuck to a single representation of a number.

But the point of type-dispatch designs is that you only care about actions, not about representation. DiffEq and LightGraphs are really making use of this and letting the user choose the types (the data representation) since it really is a choice and every representation comes with its own engineering tradeoffs. But by having a common interface, a common high level API, you can write algorithms which are independent of representation. With traits, you can query about details of the representation (is it column-major or row major?) and optimize your algorithm to the cases without having to commit to a specific one.

I for one am happy to keep following JuliaDB to see if it diverges from the standard DataFrame route to really take advantage of parallelism to the max, and see what kinds of engineering tradeoffs it has to make to do so.


#11

Thanks all three for these points - I see the design landscape and future of tables in Julia more clearly now. It’s particularly important to me, when I’m teaching students and colleagues, to know that I should prepare them for a world with multiple table implementations, rather than talking about the table-format-to-end-them-all to come in the future. It’ll steepen the learning curve, but of course one could just pretend DataFrames is all there is until the need for something else arises, and I see that it comes with advantages in the longer run.


#12

Alright, here is my current thinking about the Tables.jl design and what I would do differently:

  1. One issue is that right now there is an assumption baked into it that a client can ask for the schema of a source before it starts to fetch data. That works for some parts of the table ecosystem, but not for others (Query.jl in particular). I’m just repeating this point here for those that haven’t followed the discussion in other places, I know that @quinnj is aware of this and thinking about a solution. I also have one potential solution for this, see below [actually, I’m running out of time today, so I’m not elaborating on this point in this post. In any case, I have an idea :wink: ].
  2. The following is a broad point: I think there are many more useful tabular interfaces than just one row and column interface. For example, TableTraits.jl currently has two column interfaces that are different from the column interface in Tables.jl. I think all three of these column interfaces are useful, in their own way and for specific scenarios. I have also been thinking about a whole bunch of additional interfaces for Queryverse that would be useful in certain other contexts. So I think we should shoot for a design that can accommodate a large number of interfaces, where different folks can add new interfaces, where we can experiment with new interfaces, phase them in easily, phase them out if they turn out to be a bad idea etc., without EVER breaking the basic interop story in a TableTraits.jl/IterableTables.jl style. So this is a bit of a meta-point, and I’ll come back to what that implies in my mind below. But one thing that makes this kind of world much, much easier is if these interfaces are completely independent from each other and have no dependency on each other. If I want to use interface A, and there are also ten other interfaces around, I should be able to only deal with A, and nothing else.
  3. So if we adopt such a multi-interface model, then I think instead of having something like AccessStyle, we should have one trait function per interface that allows a client to query a source whether it supports a specific interface. So say for each interface we have a function called supports_interface_A etc. One could use these in a pretty similar way as AccessStyle, but the benefit would be that they are all independent from each other. So we would have a supports_row_interface and a supports_colum_interface and a supports_davids_other_column_interface etc. etc. functions. A source would return true for whatever interfaces it implements. I think this makes it much easier to add new interfaces: you just add another one of these trait functions, which breaks nothing. On the other hand, if there is an AccessStyle function that is shared between the row and column interface, then if you want to add a new third interface, things get a bit tricky. You can’t add that to AccessStyle without breaking existing clients, so you’ll then need to go with a new function in any case. I also think that in general this one-trait-per-interface approach makes it more natural for a source to support multiple interfaces: what if my source can implement both the row and column interface equally well? What should I return from AccessStyle in that case?
  4. I think we would have a cleaner design if we separated the interface functions from default implementations. Right now, the default implementations are the fallback methods of the functions that make up the interface. I’m broadly not a fan of that design :slight_smile: I think one issue is that one now has encoded the behavior of a useful helper/default implementation as part of the interface. I think that is too restrictive. I think in general we could follow the design here that we have in base for iterators and collect: there is one set of functions that make up the iteration protocol. And then there is a helper function (collect) that takes an object, and understands all the subtleties of the interface and makes it super easy to for a client to handle a really common case. So say we have a helper function get_me_this_table_as_a_namedtuple_of_columns, and inside that function it will do all the things that are right now in the fallback implementation: check which interfaces the source implements, decide which one is the most efficient to use, materialize things etc. If I look at the current constructor proposal here, then it seems to me that there is a lot of code that essentially any sink would also want that should really live in a helper function, and not inside each sink. Another benefit of this design is that the code that actually trades off different interfaces would no longer live in functions that are part of the interface. That would make it easier to change that code going forward, which I think would make it easier for us to try out new ideas without breaking existing code.
  5. I’m really unsure right now whether I like the idea that a row iteration interface can use arbitrary types for its rows. On one hand that seems very appealing and I can see scenarios where that is great (for example when iterating stuff from in-memory tables, or even a feather file). On the other hand I get quite nervous with the approach that the new CSV.jl takes with this. If I understand that correctly, in that case every row that is iterated is actually a lazy view into the CSV file, and when I access a specific field in this row via row.a, it will only then parse that field in the file. I can see that this is useful in specific instances, but it imposes a lot of restrictions on clients of the row interface: now if I want to write a performant client of the row interface, I need to essentially follow the following rules: a) only access each field once (otherwise I trigger a reparse), b) only access the fields in their order in the file (i.e. something like a=row.a; b=row.b would have different perf than b=row.b; a=row.a). That seems very restrictive to me. In my mind that should maybe even be a second row interface: if a client wants that behavior and knows how to deal with it, it can ask for a row iterator with that behavior, but if any row client has to anticipate this at any time, that seems difficult to me. In my mind, this whole design is too untested at this point to make it the long term, stable basic interop story. At the same time, I think we SHOULD experiment with it and see how it works. I think with my proposal of supporting multiple different interfaces, this could be done pretty easily, i.e. we could have a more conservative interface, and one that tries out a new idea like that.
  6. I think my main suggestion at this point, though, would be to that we take more time to sort these things out and iterate about the design. I sense a bit of a desire to very quickly move ahead with the current design, merge PRs, maybe even tag releases. I think we stand a much better chance to create something that we don’t have to break soon again if we slow this down, take a couple of weeks to discuss the various options and move slowly. This request is partly from my very own perspective: our semester just started, and I just don’t have the time to participate in extended discussion in the next 1-2 weeks. I would like to contribute to all of this, but that would require that we take more time for all of this.

#13

Interesting points @davidanthoff. Glad to see that there’s at least a general agreement on most of the interface. I can’t comment on all points, but here are a few questions/remarks.

Could you describe what other interfaces you have in mind?

I thought AccessStyle returned the preferred (i.e. native and most efficient) interface? Fallback methods can always be offered to support any interface, right? Or maybe you’d like one function/trait per interface which indicates whether it can be used efficiently (for example, without allocating a full copy of the data)? I imagine we could add later a series of trait indicating this (once more interfaces have been added), while AccessStyle would continue to indicate which interface is the preferred one.

So basically the question is whether we want all Table types to automatically support all interfaces (via fallbacks), or whether users should call say RowTable(df) or ColumnTable(df) before using these? I’m not sure I see the advantage of this approach. This is really equivalent to providing fallbacks automatically, except that you call them explicitly, which is more verbose. And if some code has been written e.g. with only the column-oriented interface in mind, it won’t work for row-oriented table types without some adaptation. This won’t help interoperability of table types, especially for interactive use and scripting, where you don’t design the code as carefully as when writing a package.

@quinnj can tell us more about this, but hopefully it should be possible to keep a buffer with the values which have already been parsed for the current row so that one can access them repeatedly without reparsing them. I agree a specialized interface could make sense if the performance impact of not using fields in order is large.


#14

I agree with @nalimilan here; I’ve heard you mention these “other column interfaces” a few times in various discussions, but I’ve yet to hear exactly what they are or how they work. Indeed, as far as I can tell, TableTraits includes two empty function definitions, but there don’t exist any sources/sinks that actually utilize these?

This is an interesting idea because it’s essentially the approach taken in DataStreams. i.e. Tables would indicate whether they support a given AccessStyle by overloading Tables.AccessStyle(::Type{MyTable}, ::Type{RowAccess}) = true. And as you mentioned, that allows additional interfaces to be defined in the future w/o breaking existing implementations (since the default would be that a table doesn’t support an interface unless it has overloaded the method).

I also agree w/ @nalimilan here in that it only makes things more verbose in making table types overload more methods. If a generic fallback has a bug in it, then the same “helper” function would have the same bug and all the table types that use the helper function to implement their own generic fallback have the same issue. I like the generic fallbacks because it ensures that, at least given the Tables.rows and Tables.columns interfaces, each table type can support them. I also think this somewhat comes back to personal preference/style in code organization: I really appreciate and strive to keep related code close to each other and find hyper-splitting things into separate packages very hard to approach as a developer looking to contribute.

Please be careful what you assume! If you took more time to review the current CSV.jl implementation, you’d realize that great care has been taken to ensure none of the restrictions you mention come in to play w/ providing a row-view into a csv file. Indeed, I mean that when iterating rows, you can access individual values in whatever order you want and it will “just work”. Now, the caveat is that it’s been tuned to provide a fast-case when accessing values in a row sequentially, so that should be “preferred” by users, but even accessing values in reverse order isn’t too bad in my benchmarks. As is stated in the Tables.jl documentation, a Row-interface-satisfying type only has to implement propertynames and getproperty, so a true Row obviously needs to account for cases where users may only select a few fields from a row or in whatever order they desire.

I’m fine w/ letting things bake for a little and iterating a bit on the design peculiarities, but I’m also eager to push forward because of how much better all the code is/will be; for example, CSV.jl master is cleaner, simpler, faster, and more fully featured than ever, thanks in large part to the new Tables.jl design. My plan is to push hard on as many packages as I can to get Tables.jl integration merged on master, or at least ready to merge on a branch and once we have a decent wave of packages ready, I say we flip the switch and tag everywhere. I don’t want to lose some of the wonderful momentum we’ve had rolling since JuliaCon :slight_smile:


#15

So as I wrote above I just don’t have the time in the next 1-2 weeks to engage in the kind of debate that I think we need on this. I wish I had, but I might have half an hour max per day for julia stuff right now, and I want to use that to finish the Queryverse port and get VS Code up and running. Things should calm down in about two weeks, and then I’d love to hash this out.

Just a quick response to one thing:

Please be careful what you assume! If you took more time to review the current CSV.jl implementation, you’d realize that great care has been taken to ensure none of the restrictions you mention come in to play w/ providing a row-view into a csv file.

I took another look, and it seems to me what I wrote is exactly right, and pretty much the same what you wrote a few sentences later? I didn’t write that the restriction is that you can’t access things out of order [*], I wrote that if you want to write a performant client you need to access the fields in a row in a sequential order. That strikes me exactly as the same thing you wrote about the fast-case implementation.

[*] Although, now that I did take another look at the implementation, I think it is also just faulty in its current form. Try to access a field in a given row, after you have accessed a field in the next row. You can easily get an incorrect value in that case. The problem is that the current implementation with lastparsedcol assumes that once you accessed a field in row n, you will never again access any field in any row before n. Probably easily fixed, though.


#16

It’s not a faulty case because there’s no api to do such a thing. The CSV.File is an iterable, so as you as you iterate it, you can’t get in a bad state. But we can argue implementation details another time. Good luck w/ the start of the semester!

We’ll try to keep you in the loop as we push forward on stuff in the mean time.


#17

I’ve posted a repo and description of this bug here. I don’t think it is difficult to fix, but it does require a slightly different internal model.


#18

Fair enough; fix is up here.


#19

Where can I see examples of that?
(For example fitting a mixed-effects model to a dataset larger than memory).
Will the user be able to do it or we need to wait till MixedModels.jl is adapted by its developer?


#20

Please do not ask the same question on many threads as the discussion becomes scattered.

I am not sure whether the mixed models implementation in MixedModels requires constant memory or if the memory requirements scale with the size of the datasets, but I suspect it is the latter.

OnlineStats has many interesting statistical tools that can be used with a constant memory requirement (GLM for example) and I invite you to read its documentation. I am not sure if mixed effects models are part of the OnlineStats toolkit. If there are examples in the literature of algorithms that can fit (maybe approximately) a mixed effect model without having to load the whole dataset, then probably adding them to OnlineStats is a reasonable feature request.