DataFrame by new columns containing arrays

In DataFrames we can do this:

by(iris, :Species, [:PetalLength, :SepalLength] =>
              x -> (a=mean(x.PetalLength)/mean(x.SepalLength), b=sum(x.PetalLength)))
3×3 DataFrame
│ Row │ Species    │ a        │ b       │
│     │ String⍰    │ Float64  │ Float64 │
├─────┼────────────┼──────────┼─────────┤
│ 1   │ setosa     │ 0.29205  │ 73.1    │
│ 2   │ versicolor │ 0.717655 │ 213.0   │
│ 3   │ virginica  │ 0.842744 │ 277.6   │

I want to do exactly the same thing except that the results of my operations doesn’t result in a Float64 and a Float64, my function results in a Tuple (or some iterable, like an array), an RGB, and some other type:

(μ = Point2f0(...), color = RGB(...), xy = ...)

But I keep getting ERROR: ArgumentError: mixing single values and vectors in a named tuple is not allowed.

Can it be done?

Wrap it in a Ref. It should work.

2 Likes

It did! So just for future reference, I wrapped each individual column in the results with a Ref:

(μ = Ref(Point2f0(...)), color = Ref(RGB(...)), xy = Ref(...))

I have to say though that it kind of beats the purpose of using a DataFrame for that, like, to extract a column I need to getfield.(df.column, :x)… hmmm

You might want to “spread” the Float64 values and “stack” each iterable column. You can’t do this yet but that might be a good workflow in the future.

The syntax you use soon will be disallowed. Instead of:

by(iris, :Species, [:PetalLength, :SepalLength] =>
              x -> (a=mean(x.PetalLength)/mean(x.SepalLength), b=sum(x.PetalLength)))

are you ok to write

by(iris, :Species) do x
    (a=mean(x.PetalLength)/mean(x.SepalLength),
     b=sum(x.PetalLength)))
end

or (here I got fancy on purpose - probably the first form is a natural for your use-case)

by(iris, :Species,
   [:PetalLength, :SepalLength] => ((x...) -> /(mean.(x)...)) => :a,
   :PetalLength => sum => :b)

?
(I am asking to get user feedback on what we are planning to change)

1 Like

Ref is not treated in a special way - it will not get unwrapped (unless currently but there were no plans to change it).

What you should understand is that by and combine handle output from a functions in two kinds:

  • AbstractVecOrMat, AbstractDataFrame and NamedTuple of vectors: are one kind of allowed values to be passed (treated as tables)
  • NamedTuple mixing scalars and vectors is disallowed
  • all else is the second kind of allowed values (treated as scalars)

It is not allowed to mix scalars and tables in return values from functions working on groups of GroupedDataFrame.

Now you hit the rule:

NamedTuple mixing scalars and vectors is disallowed

There are two ways around it:

  • wrap vectors in something that is considered a scalar (a one-element tuple, a Ref, etc.)
  • wrap everything in one-element vectors (not Ref), as such vectors will get unwrapped by combine/by

So I would write:

(μ = [Point2f0(...)], color = [RGB(...)], xy = [...])
1 Like

It should be stressed that this syntax may be slower than the other one if you have many groups.

Indeed that’s… fancy. :smiley: If you want something simpler, you can just do:

by(iris, :Species,
   [:PetalLength, :SepalLength] => ((p, s) -> mean(p)/mean(s))) => :a,
   :PetalLength => sum => :b)

or even closer to your original code:

by(iris, :Species,
   [:PetalLength, :SepalLength] => ((p, s) -> (a=mean(p)/mean(s), b=sum(p))

Regarding mixing scalars and vectors in a named tuple, maybe we should do that in the future for convenience. But for now wrapping scalars in one-element vectors isn’t too bad.

1 Like

will probably be disallowed but I think we will allow

by([:PetalLength, :SepalLength] => (p, s) -> (a=mean(p)/mean(s), b=sum(p)),
   iris, :Species)

based on discussion with @nalimilan as in this case it is indeed simplest.

Wait why would this be allowed? The one where the DataFrame goes first followed by the Pairs is more consistent with select and is more explicit about what’s going on.

Initially I wanted to disallow it, but @nalimilan convinced me to add it.

The first reason is that in this form we are type-stable and fast as opposed to by(some_fun, iris, :Species) which is slow.

The second reason is that in Pair as last argument only single value and vector are allowed (this is the same as kwarg form currently) on 0.20.2. Now the Pair form as a first argument is different and allows the function to return tables, like NamedTuple. And again - in the example above this is useful.

Additionally this allows map to be fast (it does not allow transformations as last arguments - they must be the first argument).

In summary - Pair as last argument is 100% consistent with select. Pair as first argument is a special case for special applications (where you want to return a table from a function not a single value or a vector or if you use map or if you need the operation to be fast).

I have just pushed a commit to https://github.com/JuliaData/DataFrames.jl/pull/2158 that implements it so you can have a look.

After a discussion such Pair will be also allowed as last if it is the only Pair passed.

Wow, sorry for the absence and thank you for the amazing attention. I’ll try to address most of the comments in chronological order:

The example code I included, and the one I assume you refer to here, came from DataFrames.jl’s own documentation (line 83 in DataFrames.jl/docs/src/man/split_apply_combine.md).

I’m ok with writing anything :sweat_smile: (thank you for your amazing work on DataFrames!)

This is my main problem. And solving it by

seems suboptimal since it involves spreading and then collecting for no “real” reason.

Neither wrapping stuff in a Ref is great cause then I can’t really refer to columns without unwrapping them.

Wrapping stuff in a vector, however, works great. I’m just wondering about the cost of creating all these one element vectors everywhere. I might be fussing unnecessarily.

After this point in the thread I feel the discussion steered towards the design of the syntax of the by function, but I might have missed some subtle way to allow a by-function to return vectors without auto-spreading (and thus also work for mixed return iterability-types)?

Long story short: It’s awesome that by can “auto-broadcast” stuff returned from the function, but I’m interested in sometimes avoiding that behavior (because the things in the cells of the DataFrame are to be considered as a singular entity, even if they are iterable) – either by wrapping things into a single element vector, or perhaps a flag?

The overhead of wrapping things in a vector is around 5 seconds for 10^7 groups, as it depends on the number of groups only not on the operation you perform (so the more complex things by does the lower the impact):

julia> df = DataFrame(rand(10^7, 2));

julia> @time by(df, :x1, :x2 => x -> x[1]);
  4.050715 seconds (90.38 M allocations: 2.974 GiB, 12.85% gc time)

julia> @time by(df, :x1, :x2 => x -> [x[1]]);
  8.739656 seconds (130.38 M allocations: 5.721 GiB, 7.32% gc time)

julia> @time by(df, :x1, :x2 => x -> x[1:1]);
  9.093664 seconds (140.39 M allocations: 6.168 GiB, 8.32% gc time)

julia> @time by(df, :x1, :x2 => x -> [x[1:1]]);
 14.341072 seconds (150.78 M allocations: 7.238 GiB, 34.06% gc time)

2 Likes