I am following the discussion around DataFrames from distance, but I am really curious about the findings of the main contributors and the future directions of the DataFrame concept in Julia. This post is an attempt to understand what we can hope for in terms of performance given the general concern around type stability and operations that return data frames with different column types.
To start the discussion, my first question is simple:
Can we have high-performance DataFrames in Julia someday? How the type system will play with this different columns, different types issue?
My second question is not really a question, but a push to foster innovation:
In the case that DataFrames aren’t gonna be fast enough, have you thought of alternatives to replace the concept in its entirety? Some data structure that is even better than DataFrames for data scientists?
IMO, the speed of DataFrames.jl is orthogonal to the data frame as a data structure. Personally, I couldn’t care less about the performance of DataFrames.jl, because to me what data frames provide are interoperability with every other relational data/relational database system. So speed is secondary to “able to accept any slop I receive from anywhere” for me.
So I guess I’m voting for the ergonomic “every column is assumed to have nulls” version of DataFrames, and if people care about performance, then have the ability to have higher performance via atomic typed columns. As opposed to the current “I didn’t see any nulls on load, so the column will never have any” typing that happens now.
That’s not exactly what happens. For example, if you import data from a SQL database, the package handling that can perfectly create columns which allow for missing values if the original column allowed them in SQL, even if the data contains no missing values. Actually, what we have changed is that DataFrame constructors now always respect the input types, which should rather help interoperability.