JuliaDB, dataframes: Speculations over the future of data packages

It is kinda unsettling for newcomers (at least those that are used to R or pandas) to see that there are different data-storage packages with the same goal, such as dataframes or JuliaDB, since many other packages depend on either of them.

I was wondering what would be the best (for Julia and its users) future for those packages… Will they eventually merge? will one take over the others?

4 Likes

I think I understand where this is coming from but I don’t see a problem.

Firstly, I don’t think DataFrames and JuliaDB have “the same goal”. DataFrames is a lightweight package for in-memory operations, perhaps akin to Python’s pandas and R’s data.frame (although there’s nothing lightweight about pandas). JuliaDB is a more ambitious project for persistent datasets and parallel out-of-core processing, perhaps akin to Python’s Dask and R’s (Microsoft’s) RevoScaleR. So they serve different audiences.

Secondly, I think this kind of thing is natural for open source - people can and will develop different things based on their needs and interests. This might indeed sometimes be frustrating and suboptimal for end users but it is also good for innovation (and seems to be encouraged by the Julia community). The reason Python and R seem to have it less is, in my opinion, rather deceiving. I think Python doesn’t have it because all of the data stack in Python is kind of “bolted on” and the initial investment is so high that it makes sense to gravitate around the initial/large projects like Numpy and Pandas regardless of their shortcomings. R has it less because most of the fundamental data structures (like data.frames) are built into the language but consistency there is little - compare, for example, base R, tidyverse, and data.table way of doing things. Or think about the numerous plotting libraries available in both Python and R. On top of that, there is just the factor of time - R and Python have had enough time that some packages / approaches have started to dominate and become de facto standard although there were/are alternatives. The reason Julia has many packages for the same thing just reflects the youth and quality of the language (quality because one can actually write this kind of fundamental packages with relative ease in Julia instead of some lower level language).

As for what would be the best future, I think a shared API and/or query language, both of which are being worked on - there are query frameworks (Query and DataFramesMeta & JuliaDBMeta) as well as active work around what a “table” interface should look like in general. Both of these efforts will provide a consistent API for end-users regardless of the ‘backend’ storage/manipulation format.

In sum, I think the “unsettling” feeling is just a normal reaction to learning and things are actually quite well in Julia in this regard.

16 Likes

This is a good point. Although I agree with the fact that DataFrames and JuliaDB were built with different aims and audiences in mind, that does not mean it will actually be the case.

For example, although JuliaDB aims at being an ambitious package allowing for technically complex stuff (as you mentioned, parallel processing or datasets persistence), it also offers nice features for data manipulation (similar to the tidyverse in R). As such, many people familiar with R and dplyr might, in fact, be more comfortable using JuliaDB than DataFrames for “simple stuff” (light data storage and manipulation), even though that was not the primary goal of JuliaDB’s devs.

This variety of packages is a good thing, no doubt about it, reflecting Julia’s dynamic community. But it is also a challenge and a question that arises for package developer which must either choose what backend to support (for example, GLM.jl works with DataFrames) or support both of them (which could become difficult to maintain).

Indeed, an over-the-top unified layout supporting different backends could be the answer…

In general I think that the future should be a common API for tabular data that is supported by several back ends. Simply DataFrames.jl was created much earlier so many packages support only it but I guess this can change in the future.

Regarding your comment

What features of dplyr are you missing in DataFramesj.jl+DataFramesMeta.jl combination? Because if there are such I think we can add them.

2 Likes

There are efforts to make things work by default on both backends, with a common interface: https://github.com/JuliaStats/StatsModels.jl/pull/57

On the specific case of JuliaDB and DataFrames, I actually hope there will be some sort of convergence in the future. The technical difference between the two is that JuliaDB tables encode the type and name of columns in their type and DataFrames do not (which means less performance on some cases but less compile time). Both could benefit from a quick way to translate to the other depending on the use case (see this comment or this issue).

I started working on ways to try and start some effort for convergence (where DataFrames would be the type-free version of JuliaDB and viceversa JuliaDB would be the fully typed version of DataFrames) and the plan could be as follows:

  • Take the columnar storage format out of JuliaDB, there is a StructArrays package now that can be used to represent the columns of a table efficiently and allows fast row iteration
  • Try to unify the API for data manipulation between DataFrames and JuliaDB
  • For JuliaDB to take a dependency on DataFrames and use it as the type-less (and thus modificable in place) version (though it’s still tricky as I think DataFrames don’t have the concept of primary columns and need names for the columns whereas JuliaDB also accepts numbered columns)

However this requires a lot of things to actually happen (and JuliaDB is still updating to work with Julia 0.7) and unifying the API will probably require a lot of discussion and that everybody is on board with it.

8 Likes

A few of us actually discussed this at JuliaCon this year and one comment I made was that it’s actually a credit to the Julia language itself that there are several whole frameworks for data munging being actively developed (DataFrames, JuliaDB, Query.jl, TypedTables + SplitApplyCombine, etc.). And guess how much C code is involved in any of those projects? Zero!

As others have stated, things may converge, but they also might not. Personally, I think it’s powerful to specific data structures optimized for different purposes, but I also realize that it increases complexity for new users and can be overwhelming.

5 Likes

Yeah, I don’t think there is any one table-like data structure that covers all use cases. There is a lot of room for specialized structure with different trade offs. One thing I wish for is some sort of well defined table interface that the different structures all implemented. That way things that operate on tables could just be defined in terms of the interface and interoperability would be a lot better.

It would also be good to have some clarity on what is the default, go to solution for when you don’t need anything special. I would guess DataFrames is that?

Well none really, for now. I’m still quite new to Julia and the combination of DataFramesj.jl and DataFramesMeta.jl looks indeed awesome (although I would maybe prefer these data manipulation functions to be implemented as regular functions instead of macros, but that’s another topic and probably related to my lack of experience in Julia).

Anyway, the “one default format (DataFrames?), and then specialized structures for specific use cases” option sounds convenient indeed.

Check out @linq's new features. With the @linq macro, you can omit the @ in chaining.

df2 = @linq df |>
transform(t = 5)
1 Like

Exactly @linq should provide convenient chaining functionality with ability to omit data frame name in front like in dplyr. In fact there is a fine control of three cases:

  • Symbol is treated as column;
  • ()^ preserves the Symbol as is;
  • cols(variable) allows to dynamically get column name from variable (this change should be merged in a few days and released with full support for Julia 1.0, currently it is _I_).

That is why macros are needed, but this allows you for a very flexible and safe usage of DataFramesMeta.

1 Like

One nice thing about R is that data.tables and tibble can easily convert to and from data frames , so that it is relatively painless for users to move back and forth between packages. If there is a base table type in Julia and all the packages support conversion to and from that type, I think new users will not be too put off.

3 Likes

I believe you can already convert seamlessly between most table data types in Julia thanks to TableTraits.jl and IterableTables.jl. And more exciting stuff is coming to v1.0 thanks to Tables.jl.

Have a look at the Queyverse tutorial on YouTube, for example. You don’t even need to worry about what specific data type you’re using while processing them…

1 Like

What is nice is that R does this conversion under the hood, and a new user doesn’t have to even know the existence or difference between a dataframe and a tibble. He can load whatever dataset and then start applying whatever functions, without caring about what function applies to what type.

Also, I am wondering how that seamless conversion between table data types would affect the workflow of developpers. Should they continue building functions for each type individually, or could they eventually build functions for let’s say Types.jl or IterableTables.jl tables, and if the user provides a DataFrame or a JuliaDB table, the function would still work?

Take a look at the Queryverse, it provides essentially the table type agnostic stuff you mentioned.

3 Likes

Hi, I dont know if you are familiar with Dplyr, but for me the big bucks lies in PIPING.
Piping in the whole tidyverse is really easy because all the functions have a consistent syntax:

table |> function(pipe_result, args)

This is what I miss the most in julia, secondly I miss some column manipulation function as If I understood well juliaDB iterates over named tuples, while dplyr iterates over column vector. hence the way I most often transform the data is by doing:

table |>
select(column_names) |>
mutate(new_column = old_column |> function_to_be_applied_over_vector,
new_column2 = old_column1 + old_column_2 ) |>
filter(is_odd(old_column_name))

Dplyr does even some optimisation within the pipe so in the previous example would first apply the filter and then calculate the rest (this for sure applies to connections to dbs where translates the query into a SQL query, not sure about in memory operations.)

Tidyverse does this. Data structure interop between packages is hardly a strength of R. In fact probably the opposite?

1 Like

Check out DataFramesMeta.jl for piping. Like dplyr it also has some optimizations to ensure everything is processed as a single function.

There is also JuliaDBMeta which works with JuliaDB.

2 Likes

Take a look at Query.jl, and in particular the standalone query commands. I believe we should have you more or less completely covered for the examples from dplyr you gave.

4 Likes

Ciao Giuliano, Julia Has the pipe operator as well.

Even more, The Pipe.jl package further extends it, making one of the most flexible tool for data analysis.

@davidanthoff, @pdeffebach, thank you for all the comments! Honestly I checked the code, and well I guess I will stick with vanilla JuliaDB syntax for now - I think there is nothing wrong with it, but all the curly brackets and @macros feel more like they complicate the syntax than simplifyi it.