JuliaDB, dataframes: Speculations over the future of data packages

There are efforts to make things work by default on both backends, with a common interface: WIP replace DataFrames with DataStreams Data.Table by kleinschmidt · Pull Request #57 · JuliaStats/StatsModels.jl · GitHub

On the specific case of JuliaDB and DataFrames, I actually hope there will be some sort of convergence in the future. The technical difference between the two is that JuliaDB tables encode the type and name of columns in their type and DataFrames do not (which means less performance on some cases but less compile time). Both could benefit from a quick way to translate to the other depending on the use case (see this comment or this issue).

I started working on ways to try and start some effort for convergence (where DataFrames would be the type-free version of JuliaDB and viceversa JuliaDB would be the fully typed version of DataFrames) and the plan could be as follows:

  • Take the columnar storage format out of JuliaDB, there is a StructArrays package now that can be used to represent the columns of a table efficiently and allows fast row iteration
  • Try to unify the API for data manipulation between DataFrames and JuliaDB
  • For JuliaDB to take a dependency on DataFrames and use it as the type-less (and thus modificable in place) version (though it’s still tricky as I think DataFrames don’t have the concept of primary columns and need names for the columns whereas JuliaDB also accepts numbered columns)

However this requires a lot of things to actually happen (and JuliaDB is still updating to work with Julia 0.7) and unifying the API will probably require a lot of discussion and that everybody is on board with it.

9 Likes