This roadmap should be characterized as the DataFrames.jl/DataTables.jl/DataStreams.jl roadmap, but not the roadmap for the data ecosystem in general. There might be other packages that are on board, but at least some folks (e.g. me) are not yet sold on the Union{T,Null}
approach and have plans/roadmaps that differ from what has been outlined in the original post here when it comes to handling missing values.
I want to stress that I’m strongly in favor of experimenting with Union{T,Null}
. But in my opinion there are too many open technical questions and too little understanding of all the ramifications of this plan to commit at this point to this approach (see some of the issues in Nulls.jl). I also think there are other approaches that have not been fully explored yet that might give us the same kind of usability that DataFrames
has right now with the speed of DataTables
, but that require a lot less (if any) changes in julia base.
With that preample, here is my current roadmap for the family of packages that I’ve created in this space over the last years (DataValues.jl, IterableTables.jl, Query.jl, CSVFiles.jl, ExcelReader.jl, ExcelFiles.jl, FeatherFiles.jl and StatFiles.jl; jointly loadable as Dataverse.jl):
- If all the
Union{T,Null}
issues are sorted out by julia 1.0 and that approach has emerged as the best approach to missing data, I’ll try to port my packages over once julia 1.0 is out (or maybe during the RC phase or something like that). I should stress that currently both of these conditions have huge questionmarks associated in my mind. - I’ll try to push the approach I took in DataValues.jl during the julia 0.6 cycle as far as I can.
DataValue
has handled the missing value story for Query.jl and IterableTables.jl for a couple of months now very successfully. I’m currently working on a port of DataTables.jl that usesDataValue
for missing data. I’m not yet done, but I’m optimistic that I can create something that has the ease of use thatDataFrame
has, without the performance problems (i.e. performance would be in line withDataTable
usingNullable
). I’d say I’m half way done with this work, so I don’t really know whether this will work out, but we should know fairly soon. I’m also not sure where this code will be hosted eventually. It might be in DataTables.jl if the folks maintaining that agree with that, or it might be a new package. I’ll write a much more detailed outline of my strategy for all of this once I’m done with the coding and once I have a sense whether it can actually work.
In my mind the high level strategy for the data ecosystem broadly should be that we try to push both the Union{T,Null}
and the DataValue
approach as far as we can, and once we have a better understanding of the trade-offs decide between those two approaches.