DataFrame by new columns containing arrays

yakir12 · March 27, 2020, 7:49pm

In DataFrames we can do this:

by(iris, :Species, [:PetalLength, :SepalLength] =>
              x -> (a=mean(x.PetalLength)/mean(x.SepalLength), b=sum(x.PetalLength)))
3×3 DataFrame
│ Row │ Species    │ a        │ b       │
│     │ String⍰    │ Float64  │ Float64 │
├─────┼────────────┼──────────┼─────────┤
│ 1   │ setosa     │ 0.29205  │ 73.1    │
│ 2   │ versicolor │ 0.717655 │ 213.0   │
│ 3   │ virginica  │ 0.842744 │ 277.6   │

I want to do exactly the same thing except that the results of my operations doesn’t result in a Float64 and a Float64, my function results in a Tuple (or some iterable, like an array), an RGB, and some other type:

(μ = Point2f0(...), color = RGB(...), xy = ...)

But I keep getting ERROR: ArgumentError: mixing single values and vectors in a named tuple is not allowed.

Can it be done?

pdeffebach · March 27, 2020, 7:53pm

Wrap it in a Ref. It should work.

yakir12 · March 27, 2020, 7:56pm

It did! So just for future reference, I wrapped each individual column in the results with a Ref:

(μ = Ref(Point2f0(...)), color = Ref(RGB(...)), xy = Ref(...))

yakir12 · March 27, 2020, 8:50pm

I have to say though that it kind of beats the purpose of using a DataFrame for that, like, to extract a column I need to getfield.(df.column, :x)… hmmm

pdeffebach · March 27, 2020, 8:55pm

You might want to “spread” the Float64 values and “stack” each iterable column. You can’t do this yet but that might be a good workflow in the future.

bkamins · March 27, 2020, 9:16pm

The syntax you use soon will be disallowed. Instead of:

by(iris, :Species, [:PetalLength, :SepalLength] =>
              x -> (a=mean(x.PetalLength)/mean(x.SepalLength), b=sum(x.PetalLength)))

are you ok to write

by(iris, :Species) do x
    (a=mean(x.PetalLength)/mean(x.SepalLength),
     b=sum(x.PetalLength)))
end

or (here I got fancy on purpose - probably the first form is a natural for your use-case)

by(iris, :Species,
   [:PetalLength, :SepalLength] => ((x...) -> /(mean.(x)...)) => :a,
   :PetalLength => sum => :b)

?
(I am asking to get user feedback on what we are planning to change)

bkamins · March 27, 2020, 9:27pm

Ref is not treated in a special way - it will not get unwrapped (unless currently but there were no plans to change it).

What you should understand is that by and combine handle output from a functions in two kinds:

AbstractVecOrMat, AbstractDataFrame and NamedTuple of vectors: are one kind of allowed values to be passed (treated as tables)
NamedTuple mixing scalars and vectors is disallowed
all else is the second kind of allowed values (treated as scalars)

It is not allowed to mix scalars and tables in return values from functions working on groups of GroupedDataFrame.

Now you hit the rule:

NamedTuple mixing scalars and vectors is disallowed

There are two ways around it:

wrap vectors in something that is considered a scalar (a one-element tuple, a Ref, etc.)
wrap everything in one-element vectors (not Ref), as such vectors will get unwrapped by combine/by

So I would write:

(μ = [Point2f0(...)], color = [RGB(...)], xy = [...])

nalimilan · March 28, 2020, 3:09pm

It should be stressed that this syntax may be slower than the other one if you have many groups.

bkamins:

or (here I got fancy on purpose - probably the first form is a natural for your use-case)
by(iris, :Species,
   [:PetalLength, :SepalLength] => ((x...) -> /(mean.(x)...)) => :a,
   :PetalLength => sum => :b)

Indeed that’s… fancy. If you want something simpler, you can just do:

by(iris, :Species,
   [:PetalLength, :SepalLength] => ((p, s) -> mean(p)/mean(s))) => :a,
   :PetalLength => sum => :b)

or even closer to your original code:

by(iris, :Species,
   [:PetalLength, :SepalLength] => ((p, s) -> (a=mean(p)/mean(s), b=sum(p))

Regarding mixing scalars and vectors in a named tuple, maybe we should do that in the future for convenience. But for now wrapping scalars in one-element vectors isn’t too bad.

bkamins · March 28, 2020, 4:44pm

will probably be disallowed but I think we will allow

by([:PetalLength, :SepalLength] => (p, s) -> (a=mean(p)/mean(s), b=sum(p)),
   iris, :Species)

based on discussion with @nalimilan as in this case it is indeed simplest.

pdeffebach · March 28, 2020, 6:05pm

Wait why would this be allowed? The one where the DataFrame goes first followed by the Pairs is more consistent with select and is more explicit about what’s going on.

bkamins · March 28, 2020, 8:48pm

Initially I wanted to disallow it, but @nalimilan convinced me to add it.

The first reason is that in this form we are type-stable and fast as opposed to by(some_fun, iris, :Species) which is slow.

The second reason is that in Pair as last argument only single value and vector are allowed (this is the same as kwarg form currently) on 0.20.2. Now the Pair form as a first argument is different and allows the function to return tables, like NamedTuple. And again - in the example above this is useful.

Additionally this allows map to be fast (it does not allow transformations as last arguments - they must be the first argument).

In summary - Pair as last argument is 100% consistent with select. Pair as first argument is a special case for special applications (where you want to return a table from a function not a single value or a vector or if you use map or if you need the operation to be fast).

I have just pushed a commit to https://github.com/JuliaData/DataFrames.jl/pull/2158 that implements it so you can have a look.

bkamins · March 28, 2020, 10:44pm

After a discussion such Pair will be also allowed as last if it is the only Pair passed.

yakir12 · March 29, 2020, 8:12am

Wow, sorry for the absence and thank you for the amazing attention. I’ll try to address most of the comments in chronological order:

The example code I included, and the one I assume you refer to here, came from DataFrames.jl’s own documentation (line 83 in DataFrames.jl/docs/src/man/split_apply_combine.md).

I’m ok with writing anything (thank you for your amazing work on DataFrames!)

This is my main problem. And solving it by

seems suboptimal since it involves spreading and then collecting for no “real” reason.

Neither wrapping stuff in a Ref is great cause then I can’t really refer to columns without unwrapping them.

Wrapping stuff in a vector, however, works great. I’m just wondering about the cost of creating all these one element vectors everywhere. I might be fussing unnecessarily.

After this point in the thread I feel the discussion steered towards the design of the syntax of the by function, but I might have missed some subtle way to allow a by-function to return vectors without auto-spreading (and thus also work for mixed return iterability-types)?

Long story short: It’s awesome that by can “auto-broadcast” stuff returned from the function, but I’m interested in sometimes avoiding that behavior (because the things in the cells of the DataFrame are to be considered as a singular entity, even if they are iterable) – either by wrapping things into a single element vector, or perhaps a flag?

bkamins · March 29, 2020, 1:59pm

The overhead of wrapping things in a vector is around 5 seconds for 10^7 groups, as it depends on the number of groups only not on the operation you perform (so the more complex things by does the lower the impact):

julia> df = DataFrame(rand(10^7, 2));

julia> @time by(df, :x1, :x2 => x -> x[1]);
  4.050715 seconds (90.38 M allocations: 2.974 GiB, 12.85% gc time)

julia> @time by(df, :x1, :x2 => x -> [x[1]]);
  8.739656 seconds (130.38 M allocations: 5.721 GiB, 7.32% gc time)

julia> @time by(df, :x1, :x2 => x -> x[1:1]);
  9.093664 seconds (140.39 M allocations: 6.168 GiB, 8.32% gc time)

julia> @time by(df, :x1, :x2 => x -> [x[1:1]]);
 14.341072 seconds (150.78 M allocations: 7.238 GiB, 34.06% gc time)

Topic		Replies	Views
Why I get 'RefValue{SubArray{Int64' and not "simply" 'SubArray{Int64' Data	13	810	January 11, 2021
Groupby and aggregate a dataframe with custom function that return a vector New to Julia dataframes	8	1878	October 18, 2021
Run multiple instances of transform on specific column combinations of a GroupedDataFrame in DataFrames mini language New to Julia question , dataframes	22	702	December 23, 2022
DataFrame `by` function error New to Julia	8	1257	June 1, 2018
Namedtuple as a single value Data dataframes , namedtuple	6	528	November 4, 2022

DataFrame by new columns containing arrays

Related topics