Lately I have been guilty of running off and doing-my-own-thing while waiting for the data ecosystem to sort itself out when I could be making more valuable contributions that would both achieve the things I am trying to do and benefit the community. Thing is, I’m not too clear on where things are going and what the current plan is. This thread is indicative that other people feel the same way. I propose that an immediate plan be agreed upon and stated in more public and less ambiguous terms (i.e. near the top of all the README’s) so that would-be contributors and developers of new packages have a much better idea of what to do with themselves.
I apologize if none of this commentary is actually new, I just have the feeling that, where there is consensus, that fact isn’t particularly well known. Here’s what I propose
-
We get a real handle on how feasible it is to replace
Nullable
with smallUnion
types, i.e. using Nulls.jl. Some performance tests need to be carried out (if they haven’t already been) to see if it is reasonable to make this transition now, as opposed to on release of v1.0. My initial experiments seem to indicate that this might be ok. This will be the trickiest part of the whole process. Feedback from the core Julia devs on how confident they are thatUnion
s really can be made as efficient asNullable
would be appreciated (it sounds a little too-good-to-be-true, but I’m gung-ho about it if there’s a solid consensus.) -
All development of
DataFrames
andDataTables
should be frozen until a Nulls.jl implementation is finished (or new PR’s should all be branches @quinnj’s nulls branch). -
Once the
DataTables
transition to Nulls is complete, development ofDataFrames
would switch to maintenance mode (if this is not already the case). Committers would be (publicly and visibly) encouraged to make contributions only toDataTables
. TheDataFrames
README would direct new users toDataTables
. -
The DataStreams.jl interface should be tidied up in
DataTables
(will require updates for nulls). The documentation forDataStreams
should be rewritten in a clear and explicit manner and the README’s of various packages using it should make it clear thatDataStreams
is an appropriate interface for transferring data between different formats, and that every new tabular data source that wants to play nice withDataTables
should implement it. -
DataStreams interfaces would be added to other packages such as IndexedTables.jl.
DataFrames
wouldbe retired completely.
One response to this I anticipate is that we shouldn’t worry about Nulls right now. To this my response is “If we don’t, there’s a good chance we’ll go through all of this all over again next year, so let’s get it right now.”
Any thoughts? Perhaps nothing I’m saying is new? If so, great. The only major point of contention here that I am aware of is that many people strongly favor DataFrames
, but I think even those people will acknowledge that the performance issues with the type-unstable DataFrames
effectively exclude them as a long term solution, so we may as well get on with it. On the other hand, maybe I’m just so completely disconnected that the thinking on all this has completely changed without my knowledge.
TL;DR
Let’s put a big fat statement at the top of every README telling people what they should use, a rough roadmap, and what they should contribute to.