Union type data frame implementation?

ExpandingMan · May 25, 2017, 3:09pm

I’m ~~a little hesitant to ask~~ mildly terrified of asking, but has any thought been given to writing a dataframe implementation using union types, as we’ve been told that updates in the handling of union types will render them efficient enough to be appropriate for use in data?

I am aware of Nulls.jl and that @quinnj has been experimenting with using this for DataStreams.jl, but I’m not aware of any actual dataframes implementation.

As far as I know, using union types even in their current state wouldn’t be any less efficient than what is already being done in DataFrames in most cases. The Nulls approach also seems superficially more similar to the approach of DataFrames rather than that of DataTables. I think the changes that would need to be made to DataFrames to make them use Nulls would be relatively minor. I suppose it would be a bit foolhardy to start on this before seeing that union types will indeed become as efficient as it has been suggested, but it’s tempting to look forward to the ultimate solution.

I was just wondering what the thinking was among the data people.

mkborregaard · May 25, 2017, 3:18pm

Check the discussion here: https://github.com/JuliaData/DataTables.jl/issues/62

quinnj · May 25, 2017, 3:20pm

Funny you should ask: https://github.com/JuliaData/DataTables.jl/pull/66

We’re currently experimenting with the design with that pull request as well as some issues over at Nulls.jl. Overall, it seems to nicely simplify things, though as you mentioned, we should certainly take more care in considering performance.

Jameson has a branch against Julia master called jn/union-bits-layout that I’m starting to play with. So far I’m trying to inspect a lot of the generated code to see what improvements are available, but I’m also interested in checking performance.

-Jacob

ExpandingMan · May 25, 2017, 3:49pm

Good to know things are moving along. I take it that DataStreams will mostly be staying in its current form (hopefully with a few major documentation updates and partial column addressing)?

My approach lately has been to use DataStreams for everything. At the risk of going off on a tangent, I absolutely despise SQL. I figured that asking for integer addressable fields in a tabular data format would be a perfectly reasonable universal bare minimum to ask of a tabular data format API, but oh no, no. SQL is firmly rooted in 1974. Anyway, where I was going with this is that I’m trying to set up to stream batches of data into machine learning using DataStreams, but this has been ugly when it comes to SQL. I’m definitely eager to help with getting DataStreams going where appropriate, so it would be nice if we had a data frame that also uses Nulls.

quinnj · May 25, 2017, 7:24pm

I have new branches for CSV and DataStreams as well (in addition to the DataTables PR referenced above) where everything uses Nulls.jl (all named jq/gangy for each repo). I’m also working on some API improvements for DataStreams, but I’ll post some thoughts on that in the other thread.

Topic		Replies	Views
Announcement: An Update on DataFrames Future Plans Data announcement	41	9247	December 27, 2017
What have we learned from DataFrames in Julia? Community poll	4	1649	November 29, 2017
DataTables or DataFrames? Data question	32	15371	November 19, 2018
Representing Nullable Values Internals & Design	39	7309	January 20, 2018
[ANN] SumTypes.jl 0.1 Package Announcements	30	3330	January 26, 2021

Union type data frame implementation?

Related topics