Tables.jl vs TableTraits.jl (was TextParse.jl is fast again)

Tables.jl was started when @davidanthoff and I had a chance to hack together at JuliaCon 2018 in London for several hours on how we could join forces with TableTraits.jl/IterableTables.jl & DataStreams.jl (it’s better for everyone if there can be a single, standard tables interface). Tables.jl was the result of that & a few subsequent discussions, in addition to a group of data people working together at the hack-athon.

Tables.jl fully supports a wide range of general table features, and I view it as a superset of both DataStreams & IterableTables. It provides a simplified interface from what DataStreams required, while not compromising on performance. It also provides more functionality than IterableTables in that it provides a first-class column-access interface. It also provides the option for tables to define their own custom row types as opposed to requiring NamedTuples as rows (this can lead to big performance boosts in some scenarios).

In order to ease transition for packages that were using DataStreams.jl or TableTraits.jl previously, several “helper” methods have been included in the Tables.jl package to provide full integration. E.g. even though parts of some queryverse packages still rely on DataValues to represent missing data, Tables.jl has helper methods to unwrap/wrap tables coming from those packages to allow integration with the rest of the ecosystem using missing and Union{T, Missing}.

At this point, the Tables.jl interface has been stable since its initial release, so packages should feel free to implement/require/depend on it without fear of things breaking. It is still relatively new and bug reports are being fixed as quickly as possible (including finally getting to the bottom of a particularly tricky Query.jl bug), but there are now some ~10 packages that use Tables.jl in their current release for data/table format interop, with some exciting new things happening/planned with FormattedTables.jl, StatsModels.jl, and JuliaDB.jl.

3 Likes

Thank you for this clarification…Tables.jl sounds very promising and it does make sense for there to be one common interface/ So will the queryverse be transitioning to Tables.jl ?

Also how does Tables.jl handle non random access or even iterable structures like distributed tables? What about nested and irregularly shaped structures like Json? (which I believe is a query.jl usecase.)

Ultimately that’s up to @davidanthoff; he’s made excellent contributions to how the Tables.jl interface has taken shape, but also mentioned elsewhere that he wants to give it time to bake and shake out any issues.

The official interface details are found on the repo README.md. In short, a structure can define Tables.rows(x::MyTable) = ... which must return an iterable of “rows”, where “row” is any property-accessible object (supports propertynames and getproperty). This makes it very flexible for cases like a distributed table.

In terms of nested/irregularly shaped data, Tables.jl is focused very much on two-dimensional formats that supports rows & columns. I have some ideas around integrating with a json package to specify how to “query” a json object to produce rows/columns, but there won’t be official support for arbitrary json objects. Keeping the Tables.jl interface restricted to rows & columns helps simplify the concepts and implementations for sources & sinks. Perhaps at some point, we’ll come up with an interface even more general.

1 Like

I do not plan to migrate the Queryverse.jl to Tables.jl. I’m not on board with a number of important design decisions in Tables.jl (*), and more generally don’t see a reason to start with yet another table interface for Queryverse.jl at this point. TableTraits.jl has been around for a long time, with a large number of integrated sources and sinks (more than any other previous or current effort in this space), and the philosophy in general for Queryverse.jl is to do incremental improvements, stay backwards compatible and not spend time on big redesigns if at all possible. I’m more than open for suggestions how one can improve and add features to TableTraits.jl, but in my mind that does not require starting from scratch.

Just to be clear, I disagree on the latter part, i.e. there is no consensus here that Tables.jl will eventually replace TableTraits.jl / IterableTables.jl, at least not from my point of view.

That is an incorrect characterization of TableTraits.jl. It had a column interface for almost a year now.

(*) Just to clarify how this is compatible with what @quinnj wrote about juliacon: we did spend an afternoon trying to find a way so that TableTraits.jl and (I thought) DataStreams.jl could gracefully co-exist and found a number of good solutions for that. That made a lot of sense to me: there was a fair amount of existing code in both ecosystems, and making sure this all worked together without breaking changes in either ecosystem seemed (and still does) like a good idea to me. Within the constraints of not breaking things, I thought we found good solutions.

Could one of the moderators split the table interface discussion out into its own thread? It seems not really related to the topic here, namely the performance of TextParse.jl.

1 Like

Understood.

Can you elaborate on the interop story?

Well, this is disappointing for the data community. The entire purpose of Tables.jl is as a community-driven approach to a unified table interface to make data formats & table types easier to integrate across the ecosystem without needing to support multiple interfaces.

Please file issues! I’m surprised by this comment because you’ve basically been involved in every major design decision since the repo began, providing very valuable input, but there don’t seem to be any outstanding comments or issues. I know you’ve mentioned elsewhere being concerned about how new the interface is, and that’s fair, but that doesn’t seem to suggest over-arching design issues.

In my mind, it’s not really starting from scratch. We’ve taken the best parts of TableTraits.jl & DataStreams.jl and created a unified interface that is very flexible and performant. In the process of switching a few of my own packages that were using DataStreams.jl to use Tables.jl, it was a joy: the new interfaces are simple to implement and encourage nice abstractions/workflows for sources & sinks.

But for me, the biggest win isn’t necessarily the improved interfaces, but the possibility of actually having a unified interface the data ecosystem can rely on. I definitely understand the need to experiment and try out new ideas, but I also think it’s an even more powerful moment when we as a community can then take a step back, and unite various efforts. It’s a huge win for users because they don’t have to be confused by multiple, very similar options. It’s a huge win for package developers as well, because they can code against a single interface and get countless integrations “for free”, like we’re seeing with StatsModels.jl. It’s non-ideal, for example, to expect packages who want to do table stuff to have to do gymnastics like is currently implemented in DataFrames, just to support Tables & TableTraits.

I understand collaboration is hard; it takes extra time and effort to coordinate, but the Julia community has always been special IMO in it’s level of coordination and collaboration. My hope is that we can collaborate and come together for the benefit of the data community.

15 Likes

I don’t understand this. Why should Queryverse.jl migrate to Tables.jl? Isn’t the goal of Queryverse to provide an API similar to R Tidyverse, how could it change the API?

In R everything supports the ‘data.frame interface’. With Julia I’m unsure; is ‘Tables.jl’ supposed to define such a common interface? If yes, then why isn’t it sufficient to add a ‘translator’ between Queryverse.jl and Table.jl? And a bit of an ignorant question: why isn’t the DataFrames.jl package ‘the boss’ to define such a common interface? (Or is it, and Tables.jl is exactly this?)

Maybe offtopic here: I didn’t see JuliaDB.jl on Tables’ currently integrated package list, is this possible/planned to be added (later)?

Query.jl will keep it’s front end dplyr like API, the discussion is on what the backend should look like.

I also completely concur with @quinnj about having a unified interface, though I didn’t feel comfortable saying so until he did.

Aside from the benefits for community code and ease , having multiple popular table interfaces only contributes to the perceived package fragmentation problem for new users.

So as a user, sometimes developer and an advocate who’d like to see julia succeed, I really hope that @quinnj and @davidanthoff can work out some sort of unification here :slight_smile:

3 Likes

As @datnamer mentioned, this wouldn’t affect the user-facing API of Query.jl at all.

Roughly speaking, yes. In R, I believe it’s more a case where alternative data.frame subtypes exist and the “interface” relies on inheritance more than the actual data.frame protocol (i.e. methods making up data.frame behavior). For Tables.jl, it’s strictly a behavioral interface; there are a set of methods to implement, and if satisfied, another set of methods are expected to Just Work. But the end goal is the same: provide an API to two-dimensional rows/columns-like structures that “downstream” packages (stats, machine learning, plotting) can code against without having to hardcode against any specific implementation.

There already are “translator” helper methods in Tables.jl to ease integration with Queryverse.jl, designed and implemented specifically to help packages currently using TableTraits to switch to Tables.jl and allow Query.jl manipulation macros to be consumable by Tables.jl sinks. While convenient, it doesn’t really simplify things if packages still have to account for both TableTraits and Tables.jl. Ideally for users & developers, they could code against a single interface and get interop everywhere.

While very popular, DataFrames.jl is just one package with opinions about how to work with table-like data. It provides a ton of functionality over thousands of lines of code, so it’s not ideal to be considered as an “interface” package. Ideally, an interface is a lightweight package with minimal dependencies that focuses on being extremely stable to enable interface implementations to take a dependency and not have to worry about things going forward.

Yes, support is planned and being actively worked on. It’s a bit involved, because JuliaDB.jl itself has several layers and JuliaDB itself isn’t quite 1.0 compatible yet, but hopefully soon.

4 Likes

Thank you quinnj and datnamer for the detailed answers!

If I understand right (please correct me if I’m wrong), the reason for not sticking with TableTraits.jl but instead introducing a new interface through Tables.jl is that the latter is “more flexible” and handles cases not covered by TableTraits.jl (and that don’t really matter for Queryverse.jl, hence the hesitation to switch). Couldn’t it then be a compromise to make Tables.jl a strict superset of TableTraits.jl? That way, the Queryverse could automatically work with all data structures supporting the newer Tables.jl interface. Authors of table data structure packages should then be encouraged to implement the full Tables.jl interface for maximum compatibility with the ecosystem. Still not ideal, but maybe a step forward.

1 Like

I really want to subscribe to this. I would love to add table support for any table type in things I develop but the if else gymnastics that’s needed really is a bit weird and IMO defeats the purpose of a common interface. My concern is that some package developers may not be easily convinced to do this and will just support one arbitrarily chosen chunk of the table ecosystem.

2 Likes

This is maybe where abstract inheritance as traits would come in handy.

and then there is TypedTables.jl. the table universe is immense

TypedTables is an implementation, just like DataFrames, not an additional abstraction. So it’s not in competition with Tables.jl and TableTraits.jl.

TypeTables REQUIRES Tables and thus (iiuc) is an example of the attempt to provide an “unified table interface to make data formats & table types easier to integrate across the ecosystem” (from quinnj further above).

With limited time available to look at code, I had the impression that Tables.jl is a proper table abstraction, but the TableTraits.jl is more specific/limited. Certainly it would be a large undertaking to migrate the many Queryverse.jl packages (don’t know if technically possible).

So here is the current situation around TableTraits.jl, as far as I can tell: there currently 21 packages that can do table interop via that ecosystem on julia 1.0. Of these, 8 are primarily maintained by folks other than me. On julia 0.6 there were an additional 8 packages that were integrated into that ecosystem. Those integrations were somehow lost in the transition to julia 1.0, but at least for some of these it would take very little effort to get them back into the fold (in some of these cases the packages themselves had not been ported to julia 1.0 when I did my big update push).

I’ve counted Query.jl as one package in this, but really each of the 11 query operators in that package is a separate sink and source. In fact, the (“internal”) interface that holds the different query operators together is essentially TableTraits.jl (so in theory each of these 11 query operators could live in a separate package). So really, if we count types that can interop via TableTraits.jl, we should count Query.jl as 11 individual sinks and sources.

I’m all for collaboration and a unified approach to table interop. But if there is an existing, large ecosystem around (like the one I described above), then I believe such collaboration needs a very different process than what we saw. In particular, I think it needs to start with a discussion of some very basic questions: 1) do we need a new interface, or is the existing one with a very large number of sinks and sources fine? 2) what are the limitations of the existing interface? 3) can we, without breaking anything, evolve the existing interface to gain new abilities? I think a first step for one unified approach would be to gain a mutual understanding on these questions, and if that emerges, go from there.

I believe I wrote pretty early on in a slack conversation that I didn’t agree with the need for an entirely new interface. I also believe I wrote somewhere here on discourse that I would design a table interop interface in a different way than Tables.jl. I also suggested at that point that we slow this process down and not rush a new design out before we have a mutual understanding of the landscape.

2 Likes

As an outside observer to this discussion and user “in the market” for an ecosystem of tabular datastructurs, my understanding of the different between the camps could be summarized as:

  • Queryverse.jl/TableTraints.jl: Focus on interop first (support many packages with simple interface), worry about performance in special cases later.
  • Tables.jl: Focus on performance first (define richer interface for different use cases: row-based/col-based), support fewer packages (now, because starting from scratch)

Not sure if this description is accurate or helpful, but I guess this discussion would be most productive if all the contributors agree on the strengths of their respective approaches.

Also, I guess it would be possible to implement interface bridges between the two, but of course this work would be redundant if one could agree on a single interface in the first place.

I am also an outsider to both projects, but intrinsically, I don’t think there is anything that prevents either design from being fully optimized.

I originally thought that Tables.jl was a simpler framework than TableTraits.jl, but I am no longer sure that using Julia’s iteration protocol directly for access is able to handle everything well (eg resource management, cf #22466).

I think that ultimately an ideal interface will require a departure from iterate.

2 posts were split to a new topic: Package for tabular data