Getting our act together in the data ecosystem


#1

Lately I have been guilty of running off and doing-my-own-thing while waiting for the data ecosystem to sort itself out when I could be making more valuable contributions that would both achieve the things I am trying to do and benefit the community. Thing is, I’m not too clear on where things are going and what the current plan is. This thread is indicative that other people feel the same way. I propose that an immediate plan be agreed upon and stated in more public and less ambiguous terms (i.e. near the top of all the README’s) so that would-be contributors and developers of new packages have a much better idea of what to do with themselves.

I apologize if none of this commentary is actually new, I just have the feeling that, where there is consensus, that fact isn’t particularly well known. Here’s what I propose

  1. We get a real handle on how feasible it is to replace Nullable with small Union types, i.e. using Nulls.jl. Some performance tests need to be carried out (if they haven’t already been) to see if it is reasonable to make this transition now, as opposed to on release of v1.0. My initial experiments seem to indicate that this might be ok. This will be the trickiest part of the whole process. Feedback from the core Julia devs on how confident they are that Unions really can be made as efficient as Nullable would be appreciated (it sounds a little too-good-to-be-true, but I’m gung-ho about it if there’s a solid consensus.)

  2. All development of DataFrames and DataTables should be frozen until a Nulls.jl implementation is finished (or new PR’s should all be branches @quinnj’s nulls branch).

  3. Once the DataTables transition to Nulls is complete, development of DataFrames would switch to maintenance mode (if this is not already the case). Committers would be (publicly and visibly) encouraged to make contributions only to DataTables. The DataFrames README would direct new users to DataTables.

  4. The DataStreams.jl interface should be tidied up in DataTables (will require updates for nulls). The documentation for DataStreams should be rewritten in a clear and explicit manner and the README’s of various packages using it should make it clear that DataStreams is an appropriate interface for transferring data between different formats, and that every new tabular data source that wants to play nice with DataTables should implement it.

  5. DataStreams interfaces would be added to other packages such as IndexedTables.jl. DataFrames wouldbe retired completely.

One response to this I anticipate is that we shouldn’t worry about Nulls right now. To this my response is “If we don’t, there’s a good chance we’ll go through all of this all over again next year, so let’s get it right now.”

Any thoughts? Perhaps nothing I’m saying is new? If so, great. The only major point of contention here that I am aware of is that many people strongly favor DataFrames, but I think even those people will acknowledge that the performance issues with the type-unstable DataFrames effectively exclude them as a long term solution, so we may as well get on with it. On the other hand, maybe I’m just so completely disconnected that the thinking on all this has completely changed without my knowledge.

TL;DR

Let’s put a big fat statement at the top of every README telling people what they should use, a rough roadmap, and what they should contribute to.


#2

But if rows become named tuples, won’t that be type-stable? I think the problem is there’s a lot of uncertainty in 1.0 matters that are related to the data ecosystem, which makes it difficult for the data ecosystem to be ready by 1.0. I think that’s fine though, since 1.0 is to give the stability that’s needed for things like this to stabilize. Though I agree that at least some coordinated plan with some clear way of telling people is what’s necessary by 1.0.


#3

Sorry, my use of language was bad. I wasn’t necessarily asserting that what I’ve just proposed is the best way of doing things or that there aren’t many other, equally good ways, but I really think it’s about time we pick something reasonable that could be a long term solution (or at least resemble one) and make a stronger commitment to it. Perhaps it would be better if we all switched to IndexedTables (there are some significant barriers to this for me, but if there’s a broad consensus on that I’m willing to contribute to making it work).

By the way, was this the proposed way forward to DataFrames? If so, I completely missed that. As far as I know, IndexedTables already does this.


#4

We’ve discussed this quite a lot among the Julia and JuliaData developers recently, I think we should be able to post an updated roadmap soon. Basically, Unions should be reasonably fast in Julia 0.7/1.0, but in 0.6 they are still slow. It’s easy to benchmark them as that’s what DataFrames is doing with DataArray column. That means we can start using the new approach with Julia 0.6, it won’t be slower than DataFrames.

One question is which of DataFrames and DataTables should be ported to Nulls. In some ways, DataFrames is closer to what the new Nulls-based framework will look like, but a lot of work has gone into DataTables recently which is not in DataFrames (yet?). At the end of the day the main question is that of the name we want to use in the future.

Anyway, if you can expect a more detailed plan soon.


#5

I’ve just posted the updated plan: