Lately I have been guilty of running off and doing-my-own-thing while waiting for the data ecosystem to sort itself out when I could be making more valuable contributions that would both achieve the things I am trying to do and benefit the community. Thing is, I’m not too clear on where things are going and what the current plan is. This thread is indicative that other people feel the same way. I propose that an immediate plan be agreed upon and stated in more public and less ambiguous terms (i.e. near the top of all the README’s) so that would-be contributors and developers of new packages have a much better idea of what to do with themselves.
I apologize if none of this commentary is actually new, I just have the feeling that, where there is consensus, that fact isn’t particularly well known. Here’s what I propose
We get a real handle on how feasible it is to replace
Uniontypes, i.e. using Nulls.jl. Some performance tests need to be carried out (if they haven’t already been) to see if it is reasonable to make this transition now, as opposed to on release of v1.0. My initial experiments seem to indicate that this might be ok. This will be the trickiest part of the whole process. Feedback from the core Julia devs on how confident they are that
Unions really can be made as efficient as
Nullablewould be appreciated (it sounds a little too-good-to-be-true, but I’m gung-ho about it if there’s a solid consensus.)
All development of
DataTablesshould be frozen until a Nulls.jl implementation is finished (or new PR’s should all be branches @quinnj’s nulls branch).
DataTablestransition to Nulls is complete, development of
DataFrameswould switch to maintenance mode (if this is not already the case). Committers would be (publicly and visibly) encouraged to make contributions only to
DataFramesREADME would direct new users to
The DataStreams.jl interface should be tidied up in
DataTables(will require updates for nulls). The documentation for
DataStreamsshould be rewritten in a clear and explicit manner and the README’s of various packages using it should make it clear that
DataStreamsis an appropriate interface for transferring data between different formats, and that every new tabular data source that wants to play nice with
DataTablesshould implement it.
DataStreams interfaces would be added to other packages such as IndexedTables.jl.
DataFrameswouldbe retired completely.
One response to this I anticipate is that we shouldn’t worry about Nulls right now. To this my response is “If we don’t, there’s a good chance we’ll go through all of this all over again next year, so let’s get it right now.”
Any thoughts? Perhaps nothing I’m saying is new? If so, great. The only major point of contention here that I am aware of is that many people strongly favor
DataFrames, but I think even those people will acknowledge that the performance issues with the type-unstable
DataFrames effectively exclude them as a long term solution, so we may as well get on with it. On the other hand, maybe I’m just so completely disconnected that the thinking on all this has completely changed without my knowledge.
Let’s put a big fat statement at the top of every README telling people what they should use, a rough roadmap, and what they should contribute to.