DataFrames.jl development survey

Hi,

before Julia & Data: An Evolving Ecosystem BOF I wanted to run a quick pool what you think about different major directions in DataFrames.jl development. So please upvote the things you perceive as important. This will help me decide where to focus the effort in short term (and discuss this during BOF).

  • add threading support
  • faster joins
  • faster aggregation
  • adding more expressiveness to the mini-language
  • adding more utility functions in the package
  • display improvements
  • decoupling of DataFramesBase.jl
  • solving problem with CategoricalArrays.jl compilation times
  • adding metadata to DataFrame

0 voters

Now let me give some comments on the options:

8 Likes

I think that factoring out join algorithms in to their own implementation-agnostic mini-package would be a great thing, especially if done in a modular way so that implementations of the Tables.jl ecosystem can select the building blocks they prefer.

7 Likes

What is the recommended voting choice for “make DataFrames simpler to use for regular data analysis (such as in Stata or R)”?

For example, “adding more expressiveness to the mini-language” includes the open issue “skipping missing values more easily” (https://github.com/JuliaData/DataFrames.jl/issues/2314) so I consider to vote for it. But the choice also includes polar opposite issues such “Index to grouped data frame using Dict” (https://github.com/JuliaData/DataFrames.jl/pull/2281).

3 Likes

This is exactly the challenge - different people find different things useful. All feature requests in this group for some users (not all) fall into “make DataFrames simpler to use” I guess. In particular, the problem is that the more helpers we add the harder the ecosystem becomes to master as a whole.

Therefore for the purpose of the “high level” survey I put them into one bag so see the relative importance of this aspect vs other, like performance or internal design.

Having said that, as you probably noticed, we track and manage all the issues in all the aspects mentioned, so this will mainly influence “when” (not “if”) things are shipped. E.g. if we decide that performance is priority, this means that probably in the coming 6 months we will mostly focus on this aspect, as performance tuning requires a lot of developer time and testing.

2 Likes

Thanks for the survey. n=1 perspective: historically, I’ve really enjoyed working with DataFrames.jl because it’s simple (from the user’s perspective, not necessarily under the hood). Columns are normal vectors, indexing and for loops work exactly the way you’d hope, etc. Most of my code looks just like normal Julia code, not a special data sub-language. Copies and views work just like in the rest of Julia. Etc. In general, you don’t have to know special DataFrames rules or logic or functions to work with dataframes; you can write the same kind of code you use elsewhere and it’ll be readable and fast. DataFrames keeps your data organized nicely together, knows how to perform joins, and that’s…about it? (Again, I know there’s a ton under the surface and am in no way trying to discount the deep work there; this is appreciation for the great success of making things simple to the user.)

Contrast to Python pandas, where dataframes and series are really complicated. They define special types with special behavior, and I find the mental overhead cripplingly high: is this a copy or a view? Do I need to reset_index? Or is that ignore_index? Are these Python or Pandas datetime objects? Is this sort in place? Will this append tank performance?

With this perspective in mind, I’m pretty hesitant about adding complexity to DataFrames: metadata on DataFrames, more utility functions, etc. It’s easy to say “don’t use it if you don’t want it,” which is probably mostly right most of the time. But there’s inevitably some overhead in mental effort, in trying to understand tools, in performance, in what techniques become standard in the ecosystem, in interoperabilitiy with other languages. (Everyone has tabular data, but maybe not tabular data with the same notion of metadata.)

As an example of complexity, the split-apply-combine docs are pretty complicated, and I have to look up stuff there all the time. The special cols => function => target_col syntax is, while elegant, somewhat unlike syntax found elsewhere. Hard to make it (and all the variants thereof in the docs) stick in the brain. In contrast, the syntax

combine(grouped_df) do df
    DataFrame(...)
end

feels more like the rest of Julia to me, and thus requires less mental overhead. Again, n=1, YMMV. I understand there are deep reasons for the current system and am not trying to incite a design complaint or debate, just to illustrate my experience.

Soooo…all this is a long way to say that
a) I love DataFrames.jl and use it daily
b) My vote is for development to focus on making the package lightweight - both in code and mental overhead - rather than more complicated/powerful. Julia itself is already powerful, and the joy of using Julia rather than pandas or similar is that you don’t need deep magic to protect you from your tools :slight_smile:

18 Likes

Thank you for this comment. Actually you probably see from the initial post that we see this angle.

My general comment to this is:

Except for ! which was needed, as in Base you do not have “no copy” access to a column/columns of a matrix except when using a view I think we are just consistent with Base in “fundamental” operations, which I tried to show during the workshop on indexing at JuliaCon2020. This syntax is powerful enough to ensure that you can express everything relatively easily, which means that I think many users can just use DataFrames.jl without having to learn what I described as “mini-language”, which admittedly is complex.

Now why this “mini-language” exists? The original reason is performance. Consider this simple example:

julia> df = DataFrame(g = rand(10^6));

julia> gdf = groupby(df, :g);

julia> @benchmark combine(nrow, $gdf)
BenchmarkTools.Trial:
  memory estimate:  23.85 MiB
  allocs estimate:  173
  --------------
  minimum time:     11.235 ms (0.00% GC)
  median time:      11.943 ms (0.00% GC)
  mean time:        13.909 ms (15.04% GC)
  maximum time:     19.129 ms (37.79% GC)
  --------------
  samples:          360
  evals/sample:     1

julia> @benchmark combine(sdf -> nrow(sdf), $gdf)
BenchmarkTools.Trial:
  memory estimate:  228.88 MiB
  allocs estimate:  7999119
  --------------
  minimum time:     331.926 ms (2.87% GC)
  median time:      355.022 ms (4.72% GC)
  mean time:        367.095 ms (7.04% GC)
  maximum time:     537.080 ms (29.09% GC)
  --------------
  samples:          14
  evals/sample:     1

or

julia> @benchmark filter(:g => <(0.5), $df)
BenchmarkTools.Trial:
  memory estimate:  7.75 MiB
  allocs estimate:  23
  --------------
  minimum time:     3.935 ms (0.00% GC)
  median time:      4.176 ms (0.00% GC)
  mean time:        4.930 ms (14.67% GC)
  maximum time:     11.904 ms (64.40% GC)
  --------------
  samples:          1012
  evals/sample:     1

julia> @benchmark filter(row -> row.g < 0.5, $df)
BenchmarkTools.Trial:
  memory estimate:  53.52 MiB
  allocs estimate:  2999511
  --------------
  minimum time:     94.269 ms (0.00% GC)
  median time:      96.451 ms (1.36% GC)
  mean time:        99.182 ms (1.36% GC)
  maximum time:     112.475 ms (0.59% GC)
  --------------
  samples:          51
  evals/sample:     1

And this is why originally it was introduced.

And now the second step - once it came to the existence - many people found it useful and asked for extensions to cover more and more use cases, which I can also understand.

So in summary, what we have is “compact Base” that is aligned with Julia Base (and I guess for many users this is most what they need) and “mini language” which favors terse expressiveness and performance. Now probably the challenge is that there are not many questions on Slack or SO about the “Base” as this is probably easy to use, but many questions about “mini language”, because it is challenging to master, so probably the feeling is that the package revolves around these extensions, but in reality it is only about 25% of the code base and the functionality we provide (but admittedly with select/transform/combine you can do almost everything now).

EDIT: as a task for me: maybe I should give more examples in answers how do do things just using “Base” functionality :smile:.

EDIT 2: in other words - without the “mini language” we would not be able to compete with data.table in terms of performance (and now we can, and we can be even faster than we are now)

2 Likes

In general there are frequently discussions about whether users really understand what a Symbol is and when to use :x as syntax for column selection. R doesn’t really have this problem as you can use Strings for all your column selection needs, outside of dplyr of course.

So this issue falls into the category of making it simpler to use for regular data analysis because, if implemented, it’s one less way for new users to get stressed out about column selection.

2 Likes

Thank you, that was not clear to me. Still, I suggest that an option “make DataFrames simpler to use for regular data analysis” would probably have generated more votes than “adding more expressiveness to the mini-language” (currently 32%).

1 Like

What is “regular data analysis”?

1 Like

I think that performance improvements should continue to be a top priority. Threading, faster joins, faster aggregation… all of those need to happen.

I like that the “mini-language” (I guess we’re calling it that) made things fast. It is somewhat more compact and nicer to write, but that to me is secondary to what it really gives – speed. So adding to this expressiveness is less important to me unless it brings more functionality to this faster subset of operations. Additional expressiveness or utility functions… seems like a lot of this could go into the DataFramesMeta rewrite.

I don’t quite see the point of a DataFramesBase.jl when DataFrames.jl already has Tables.jl on one side and DataFramesMeta.jl on the other. It seems like DataFrames.jl IS DataFramesBase.jl.

Fair comment. To me regular data analysis here means using DataFrames.jl instead of the equivalent table commands in Stata or data frames commands in R. It is a Julia beginner’s perspective. For example, I immediately understand the value of “skipping missing values more easily”. But as a beginner I did not understand the value of “Index to grouped data frame using Dict” until @pdeffebach explained it.

My wish would be for DataFrames to match the performance and concise syntax of table operations in kdb+/q.

There are two reasons:

  1. if someone uses eg. Queryverse then a lot of functionality is duplicated, so it would be better not to pollute the namespace with functions that would not be used anyway
  2. a particular case is CategoricalArrays.jl (but I have a separate question for it) - as this DataFramesBase.jl package would not depend on it. The issue with CategoricalArrays.jl is that it currently causes a lot of invalidation of compiled methods in Base (but if this can be resolved by other means this reason will mostly disappear)
2 Likes

My apologies if this is naive, or technically difficult to implement, or if this has already been discussed elsewhere. It seem to me like the best way to add metadata to data frames is not to add metadata to data frames but to make it easy to wrap data frames and implement an AbstractDataFrame interface for your wrapper type.

For example, if there were a small set of functions required to implement the AbstractDataFrame interface, say foo, bar, and baz, then I could do something like this:

struct MetaDataFrame <: AbstractDataFrame
    name::String
    df::DataFrame
end

foo(mdf::MetaDataFrame, args...) = foo(mdf.df, args...)
bar(mdf::MetaDataFrame, args...) = bar(mdf.df, args...)
baz(mdf::MetaDataFrame, args...) = baz(mdf.df, args...)

Does that make sense?

1 Like

It is not possible to do it easily, as most functionality is type based (i.e. using dispatch) not trait based in DataFrames.jl.

This might sound crazy and maybe useless, but I do like the idea of modularity and small packages that do one thing right. I wonder if it’s possible to brake DataFramesBase as the provider of the DF, just the sink, and provide display options. Other operations (joins, grouping, sorting, etc) can live in their own small packages and we can provide a “DataAnalysis” stack with something called DataFramesMeta (I know that already exist). The idea behind something like this is that it might provide flexibility for package developers to focus on some aspects of the system and improve it, without dealing with a huge package. And other operations can be added as a module. And users that just want to use it, get the whole metapackage and that’s it. And experienced users can just load what they need, because sometimes people don’t do joins or sorting.

Anyway, this might be too much work and I’m not a collaborator on the project, but I really like that Julia mentality of abstractions that can help generalize and other packages taking advantages of those abstractions.

1 Like

First of all, a big thank you to Bogumil, Peter, Milan, Jacob, and all others that have been working on DataFrames.jl and related packages.

For what it’s worth, I also still feel that it takes me too long (in terms of time to code and characters to type) to perform routine data wrangling operations (this may, of course, be because of my inability to get rid of old habits). But given that different people have different habits, perhaps the high-level interfaces that cater to these different habits do not belong into DataFrames.jl (e.g. my experimental Stata-like interface) . If that’s the strategy, I think it would be good to prioritize going for a 1.0 and keeping that interface stable.

That said, I think performance would be very important too. After all, this is one of the things that brought many of us to Julia in the first place.

1 Like

The contract I try promise is that what is in 0.21 will not be broken, except for cases which we consider bugs or semi-bugs (i.e. what we do now is mostly adding functionality and improving performance). You can see what would be broken under breaking label on GitHub, and you will notice that these are mostly corner cases.

EDIT

And because of this I think you should treat 0.21 already as a stable base for these “extension” packages which are welcome (and then we can focus in DataFrames.jl on performance - per the results of the survey).

2 Likes

Thanks for the feedback. fwiw I think that a lot of improvement can be had with just

  1. Not requiring :x and letting users write x
  2. Some sort of SkipMissingDataFrame or skipmissing syntax at the transform call.

These are definitely a focus for me for development with DataFramesMeta.

2 Likes

Fantastic, thanks for the clarification!

I just read your proposal on DataFramesMeta (apologies, I’m a bit behind on everything), which sounds very good to me. Looking forward to it, and thanks for taking over DataFramesMeta!