DataFrames.jl development survey

Just to pile on to my previous comment, making DataFrame operations nearly as performant as operations on Vectors or Arrays would be I think a dream outcome. For example, I am doing an optimization problem involving simulating panels of data. There is a nontrivial cost to doing calculations on this when the simulated data is in a DataFrame versus the approach where keep everything in arrays like I’m a MATLAB-loser. Of course it is much nicer to write the DataFrames-style code, which is a huge plus, but when the performance gap is hit millions of times it starts to really add up.

It is all still faster than MATLAB anyway so why am I even complaining?

In any place where you hit this kind of performance bottleneck, use one of (depending on the use case):

  1. a function barier extracting the columns you need
  2. Tables.columns
  3. Tables.namedtupleiterator

to get the columns you need to do the computation on fast.

1 Like

Just make it as fast as possible.

Thank you for all the hard-work you’ve put into this package - I use it pretty much daily and love it.

To me the order of priorities goes like this:
Make a thing -> make that thing fast -> make it easy to use -> make it modular for developers

I see a lot of workflows like the following
Make a thing fast -> make it modular -> now I have a thing

I think the first workflow is best for “utility” packages ie ones that provide essential tools used across a variety of fields and that newcomers immediately install or are interested in. I see the second workflow as best for the developers who want to build off of your work to strengthen Julia itself.

I am torn about the “utility” functions question… Part of me thinks its ridiculously important, especially for people who haven’t been monitoring DataFrames.jl development, for all the goodies to be in one place. Someone quickly groking some packages will see DataFrames.jl and know they need that, but they won’t neccessairily know they need something like “DataFramesUtils.jl” unless they read about and understand the design decisions. Which not to be negative - most people using the package probably don’t want to do that - people working on a similar package definitely would. And then there’ll be really cringey pythonista blog posts, by yes people not as smart, about how “slow julia is because you have to compile 10 packages to join and filter on 2 csvs”.

That being said I know the julian way of building packages is to make them as modular as possible while still having each unit retain its meaning. But sometimes I do worry what that does to readability of codebases. No one wants to jump across 5 packages to see if they can use your code (not talking about developing off of your code), finding documentation, and increased surface area for things to go wrong.That whole death by abstraction thing (ex: https://github.com/EnterpriseQualityCoding/FizzBuzzEnterpriseEdition)…

No no no no no

writing x means and always should mean “the value of the variable x” so

x="foo"
mydf[!,x]

means and ALWAYS should mean grab column foo

this is an anti-pattern in R that it uses the text you type instead of the values of variables which makes it nearly impossible to reason about any serious R code

6 Likes

This is true! I definitely agree with you about the perils of using tidy evaluation in this way and it is a constant source of frustration for me in R.

The use of literals would only apply inside DataFramesMeta macros, i.e. @transform, @select, @with, and @byrow. Therefore readers of the code will always be alerted to the presence of non-standard evaluation with a macro.

Second, we would work very hard to ensure that

  1. It is 100% as easy to use x = "foo" as a column name as it is to use code literals. Testing will ensure that there is feature parity between the two and that escaping rules are far more clear than R’s quo and enquo.
  2. It is always obvious when something is a local variable that is not a column name. Perhaps via $ or some other Julia convention.

Maintaining these rules and clarity is very important. It is in my mind the major benefit that we can implement in Julia over R. If I can’t find a way to get consistent escaping rules, then this change won’t get implemented.

Nonetheless, I often use Parameters.jl’s @unpack and @pack macros when working with data in Julia simply because I find using Symbols and indexing cumbersome. Users have routinely complained about the verbosity of DataFrames, including the use of : in certain places.

Finally, note that we currently have a (manageable) level of ambiguity about code literals being used in DataFames with df.x always returning the column :x no matter what the variable x represents.

3 Likes

Yes, this I can get on board with. Inside a macro code means something else than outside a macro, and that’s always true, so the only way to know how a thing works is to read the docs on the macro. I’m ok with that. The problem is in R that you never know where nonstandard evaluation will just appear, there’s no indication in the text of the code.

Exactly. I can’t tell you how many times I’ve been bitten by trying to use the value of a variable in a plot or a data summarization or whatever and instead getting a blank plot because the name of the variable was used instead of the value of the variable… it’s a disaster in R, one of the straws that broke the camels back for me regarding R, and it’s gotten progressively worse as Hadley has polluted the R ecosystem with weird evaluation.

consider in ggplot2 aes() vs aes_() vs aes_str()

we’ve had like 10 years of people futzing with this, no one I know is really sure how to make any given thing they’re trying to do work, except the very simplest thing like ggplot(foo, aes(x,y)) … only to find that if you get the help on aes_ that

Life cycle:

 All these functions are soft-deprecated. Please use tidy
 evaluation idioms instead (see the quasiquotation section in
 ‘aes()’ documentation).

But that’s how it’s supposed to work right? I mean

struct Foo
  x::Int 
end
x="bar"
foo = Foo(1)
foo.x

should always return the x field of foo, not throw an error “There is no field named bar in a struct Foo”

So this is more or less the way it works everywhere else. I’m going to have to go watch the talk on DataFrames indexing though. I had intended to do that all along, but need to make the time. Glad they’re all recorded!

Also thanks all for the work making DataFrames extremely useful. I’m opinionated here, but not ungrateful!

1 Like

I absolutely agree with you about ggplot, I even asked discourse in R about it yesterday!

2 Likes