I’m a little hesitant to ask mildly terrified of asking, but has any thought been given to writing a dataframe implementation using union types, as we’ve been told that updates in the handling of union types will render them efficient enough to be appropriate for use in data?
I am aware of Nulls.jl and that @quinnj has been experimenting with using this for DataStreams.jl, but I’m not aware of any actual dataframes implementation.
As far as I know, using union types even in their current state wouldn’t be any less efficient than what is already being done in DataFrames in most cases. The Nulls approach also seems superficially more similar to the approach of DataFrames rather than that of DataTables. I think the changes that would need to be made to DataFrames to make them use Nulls would be relatively minor. I suppose it would be a bit foolhardy to start on this before seeing that union types will indeed become as efficient as it has been suggested, but it’s tempting to look forward to the ultimate solution.
I was just wondering what the thinking was among the data people.
Funny you should ask: https://github.com/JuliaData/DataTables.jl/pull/66
We’re currently experimenting with the design with that pull request as well as some issues over at Nulls.jl. Overall, it seems to nicely simplify things, though as you mentioned, we should certainly take more care in considering performance.
Jameson has a branch against Julia master called jn/union-bits-layout
that I’m starting to play with. So far I’m trying to inspect a lot of the generated code to see what improvements are available, but I’m also interested in checking performance.
-Jacob
2 Likes
Good to know things are moving along. I take it that DataStreams will mostly be staying in its current form (hopefully with a few major documentation updates and partial column addressing)?
My approach lately has been to use DataStreams for everything. At the risk of going off on a tangent, I absolutely despise SQL. I figured that asking for integer addressable fields in a tabular data format would be a perfectly reasonable universal bare minimum to ask of a tabular data format API, but oh no, no. SQL is firmly rooted in 1974. Anyway, where I was going with this is that I’m trying to set up to stream batches of data into machine learning using DataStreams, but this has been ugly when it comes to SQL. I’m definitely eager to help with getting DataStreams going where appropriate, so it would be nice if we had a data frame that also uses Nulls.
I have new branches for CSV and DataStreams as well (in addition to the DataTables PR referenced above) where everything uses Nulls.jl (all named jq/gangy
for each repo). I’m also working on some API improvements for DataStreams, but I’ll post some thoughts on that in the other thread.