What have we learned from DataFrames in Julia?

juliohm · November 29, 2017, 5:47am

I am following the discussion around DataFrames from distance, but I am really curious about the findings of the main contributors and the future directions of the DataFrame concept in Julia. This post is an attempt to understand what we can hope for in terms of performance given the general concern around type stability and operations that return data frames with different column types.

To start the discussion, my first question is simple:

Can we have high-performance DataFrames in Julia someday? How the type system will play with this different columns, different types issue?

My second question is not really a question, but a push to foster innovation:

In the case that DataFrames aren’t gonna be fast enough, have you thought of alternatives to replace the concept in its entirety? Some data structure that is even better than DataFrames for data scientists?

randyzwitch · November 29, 2017, 2:53pm

IMO, the speed of DataFrames.jl is orthogonal to the data frame as a data structure. Personally, I couldn’t care less about the performance of DataFrames.jl, because to me what data frames provide are interoperability with every other relational data/relational database system. So speed is secondary to “able to accept any slop I receive from anywhere” for me.

So I guess I’m voting for the ergonomic “every column is assumed to have nulls” version of DataFrames, and if people care about performance, then have the ability to have higher performance via atomic typed columns. ~~As opposed to the current “I didn’t see any nulls on load, so the column will never have any” typing that happens now.~~

nalimilan · November 29, 2017, 3:02pm

That’s not exactly what happens. For example, if you import data from a SQL database, the package handling that can perfectly create columns which allow for missing values if the original column allowed them in SQL, even if the data contains no missing values. Actually, what we have changed is that DataFrame constructors now always respect the input types, which should rather help interoperability.

randyzwitch · November 29, 2017, 3:07pm

My apologies for mis-representing, still getting used to the 0.11.x updates

juliohm · November 29, 2017, 5:56pm

Thank you for the answers. What else do you think is an unsolved challenge?

Topic		Replies	Views
Getting our act together in the data ecosystem Data	4	1788	July 4, 2017
Union type data frame implementation? Data	4	1107	May 25, 2017
How much performance potential does DataFrames have? Offtopic question	7	4429	February 18, 2021
Announcement: DataFrames Future Plans Data announcement	27	7938	July 4, 2017
Announcement: An Update on DataFrames Future Plans Data announcement	41	9248	December 27, 2017

What have we learned from DataFrames in Julia?

Related topics