DataFrames.jl development survey

Just to pile on to my previous comment, making DataFrame operations nearly as performant as operations on Vectors or Arrays would be I think a dream outcome. For example, I am doing an optimization problem involving simulating panels of data. There is a nontrivial cost to doing calculations on this when the simulated data is in a DataFrame versus the approach where keep everything in arrays like I’m a MATLAB-loser. Of course it is much nicer to write the DataFrames-style code, which is a huge plus, but when the performance gap is hit millions of times it starts to really add up.

It is all still faster than MATLAB anyway so why am I even complaining?

In any place where you hit this kind of performance bottleneck, use one of (depending on the use case):

  1. a function barier extracting the columns you need
  2. Tables.columns
  3. Tables.namedtupleiterator

to get the columns you need to do the computation on fast.

1 Like

Just make it as fast as possible.

1 Like

Thank you for all the hard-work you’ve put into this package - I use it pretty much daily and love it.

To me the order of priorities goes like this:
Make a thing → make that thing fast → make it easy to use → make it modular for developers

I see a lot of workflows like the following
Make a thing fast → make it modular → now I have a thing

I think the first workflow is best for “utility” packages ie ones that provide essential tools used across a variety of fields and that newcomers immediately install or are interested in. I see the second workflow as best for the developers who want to build off of your work to strengthen Julia itself.

I am torn about the “utility” functions question… Part of me thinks its ridiculously important, especially for people who haven’t been monitoring DataFrames.jl development, for all the goodies to be in one place. Someone quickly groking some packages will see DataFrames.jl and know they need that, but they won’t neccessairily know they need something like “DataFramesUtils.jl” unless they read about and understand the design decisions. Which not to be negative - most people using the package probably don’t want to do that - people working on a similar package definitely would. And then there’ll be really cringey pythonista blog posts, by yes people not as smart, about how “slow julia is because you have to compile 10 packages to join and filter on 2 csvs”.

That being said I know the julian way of building packages is to make them as modular as possible while still having each unit retain its meaning. But sometimes I do worry what that does to readability of codebases. No one wants to jump across 5 packages to see if they can use your code (not talking about developing off of your code), finding documentation, and increased surface area for things to go wrong.That whole death by abstraction thing (ex: https://github.com/EnterpriseQualityCoding/FizzBuzzEnterpriseEdition)…

1 Like

No no no no no

writing x means and always should mean “the value of the variable x” so

x="foo"
mydf[!,x]

means and ALWAYS should mean grab column foo

this is an anti-pattern in R that it uses the text you type instead of the values of variables which makes it nearly impossible to reason about any serious R code

9 Likes

This is true! I definitely agree with you about the perils of using tidy evaluation in this way and it is a constant source of frustration for me in R.

The use of literals would only apply inside DataFramesMeta macros, i.e. @transform, @select, @with, and @byrow. Therefore readers of the code will always be alerted to the presence of non-standard evaluation with a macro.

Second, we would work very hard to ensure that

  1. It is 100% as easy to use x = "foo" as a column name as it is to use code literals. Testing will ensure that there is feature parity between the two and that escaping rules are far more clear than R’s quo and enquo.
  2. It is always obvious when something is a local variable that is not a column name. Perhaps via $ or some other Julia convention.

Maintaining these rules and clarity is very important. It is in my mind the major benefit that we can implement in Julia over R. If I can’t find a way to get consistent escaping rules, then this change won’t get implemented.

Nonetheless, I often use Parameters.jl’s @unpack and @pack macros when working with data in Julia simply because I find using Symbols and indexing cumbersome. Users have routinely complained about the verbosity of DataFrames, including the use of : in certain places.

Finally, note that we currently have a (manageable) level of ambiguity about code literals being used in DataFames with df.x always returning the column :x no matter what the variable x represents.

3 Likes

Yes, this I can get on board with. Inside a macro code means something else than outside a macro, and that’s always true, so the only way to know how a thing works is to read the docs on the macro. I’m ok with that. The problem is in R that you never know where nonstandard evaluation will just appear, there’s no indication in the text of the code.

Exactly. I can’t tell you how many times I’ve been bitten by trying to use the value of a variable in a plot or a data summarization or whatever and instead getting a blank plot because the name of the variable was used instead of the value of the variable… it’s a disaster in R, one of the straws that broke the camels back for me regarding R, and it’s gotten progressively worse as Hadley has polluted the R ecosystem with weird evaluation.

consider in ggplot2 aes() vs aes_() vs aes_str()

we’ve had like 10 years of people futzing with this, no one I know is really sure how to make any given thing they’re trying to do work, except the very simplest thing like ggplot(foo, aes(x,y)) … only to find that if you get the help on aes_ that

Life cycle:

 All these functions are soft-deprecated. Please use tidy
 evaluation idioms instead (see the quasiquotation section in
 ‘aes()’ documentation).

But that’s how it’s supposed to work right? I mean

struct Foo
  x::Int 
end
x="bar"
foo = Foo(1)
foo.x

should always return the x field of foo, not throw an error “There is no field named bar in a struct Foo”

So this is more or less the way it works everywhere else. I’m going to have to go watch the talk on DataFrames indexing though. I had intended to do that all along, but need to make the time. Glad they’re all recorded!

Also thanks all for the work making DataFrames extremely useful. I’m opinionated here, but not ungrateful!

1 Like

I absolutely agree with you about ggplot, I even asked discourse in R about it yesterday!

2 Likes

Would matrix work for u?

I actually did switch back to the “manual” method using matrices, for performance reasons.

How much performance potential still untapped for DataFrames? Is there any hope to catch up with data.table?

There are different cases: in some we are slower, in some we are on par, in some we are faster. Definitely where we are slower is time of a first run. In order to reduce it you need to build a custom system image.

Now - in general, in cases where we are much slower (joins) a faster implementation is currently under development. There are also incremental improvements in split-apply-combine.

Finally - we currently do not do multi threading, as data.table does. It is also planned to be implemented in the future.

2 Likes

Are there no cases in which we are faster? :wink:

(I assume/hope there’s a typo in your first sentence!)

1 Like

Yes - typo. In some we are faster

I still find DataFrames quite verbose. Not as verbose as pandas but more verbose than data.table.

For example, df[df.sex .== "male", :] would be df[sex == "male"] in data.table. Within the DataFrame environment, there is no confusion between df.sex and sex. I am OK with df[:sex .== "male"] .

Also, the filter function seems very strange, because it takes a DataFrame on the second argument while the other functions take it on the first argument.

For example,

@pipe df |>
 filter(:sex => ==("male"), _) |>
 groupby(_, :pclass) |>
 combine(_, :age => mean)

This just does not seem consistent.

If all the functions take a DataFrame on the first argument, then it would be:

@pipe df |>
 filter(_, :sex => ==("male")) |>
 groupby(_, :pclass) |>
 combine(_, :age => mean)

Now it looks more consistent. If in the future it can be reduced to:

@pipe df |>
 filter(:sex => ==("male")) |>
 groupby(:pclass) |>
 combine(:age => mean)

There is no confusion.

To me, I hope it can be reduced to:

@pipe df |>
filter(:sex .== “male”) |>
groupby(:pclass) |>
combine(mean(:age))

Or even better, just make |> a function of DataFrames, so people do not need to type @pipe every time they want to use |>.

1 Like

these kinda syntax can only be achieved macros

@dt df[sex == "male"]

but no one’s written the @dt macro yet.

you can do @where from DataFramesMeta so

@where(df, :sex .== "male")

This has been discussed elsewhere, you can load using DataConvenience instead of use the @where macro.

Just use DataConvenience.jl or using Lazy: @>

then you can do

using DataConvenience: filter, @>
using DataFramesMacro: @based_on
@> df begin
  filter(:sex .== “male”)
  groupby(:pclass)
  @based_on(mean(:age))
end

this was discussed elsewhere and the proposal was to make groupby etc into curried versions but it was voted down.

I also voted it down, because you can just use a macro.

The macrosystem in Julia is not as convenient as R’s as it makes a distinction between macro and normal function whereas in R every function has the potential to be a macro.

Maybe I am wrong, but this extra verbosity does not seem to bring any clarity, performance, or consistency.

I mean no offense, but the workaround you mentioned:

using DataConvenience: filter, @>
using DataFramesMacro: @based_on
@> df begin
filter(:sex .== “male”)
groupby(:pclass)
@based_on(mean(:age))
end

seems really strange. The mixture of macros and functions makes the syntax really strange.

I really appreciate the efforts by the DataFrames team, but I think I will stick to DataFramesMeta. To be honest, the built-in functions in DataFrames confuse me a lot.

2 Likes

I agree. That is a problem. So I think DataFramesMeta should introduce an all macros approach so it becomes

using DataConvenience: @>
using DataFramesMacro
@> df begin
  @filter(:sex .== “male”)
  @groupby(:pclass)
  @combine(mean(:age))
end

Espeically if you come from R. But Julia has a better chance of being logically and aesthetically consistent.

Actually, dplyr is really odd and data.table is also really odd. I remember being really confused when I first learned them. No doubt, it’s similiar to how u feel abt Julia now.

Do you use dplyr or data.table?

using DataConvenience
using DataFramesMacro

@> df begin
  @where(:sex .== “male”)
  @groupby(:pclass)
  @combine(mean(:age))
end

this is not much more verbose once it’s done.

But if you use it alot then the extra verbosity is just two lines of using. So averaging over hundreds of data manipulation code. It’s no big deal.

I prefer data.table in my work due to its speed and memory-efficiency.

To my best knowledge, many R data science classes use dplyr because it is easier.

I have always used DataFramesMeta and I thought the plan is to test functions in it and then absorb the good ones into DataFrames. Now I realized that DataFrames started a lot of new things by itself…

Maybe in the future. My reading is that DataFramesMeta.jl is a convenience layer on top of DataFrames.jl which will remain macro-less in the foreseeable future so the core devs can focus on core functionalities and leave convenience features to other packages, at least for now.