I want to do exactly the same thing except that the results of my operations doesn’t result in a Float64 and a Float64, my function results in a Tuple (or some iterable, like an array), an RGB, and some other type:
(μ = Point2f0(...), color = RGB(...), xy = ...)
But I keep getting ERROR: ArgumentError: mixing single values and vectors in a named tuple is not allowed.
I have to say though that it kind of beats the purpose of using a DataFrame for that, like, to extract a column I need to getfield.(df.column, :x)… hmmm
Regarding mixing scalars and vectors in a named tuple, maybe we should do that in the future for convenience. But for now wrapping scalars in one-element vectors isn’t too bad.
Wait why would this be allowed? The one where the DataFrame goes first followed by the Pairs is more consistent with select and is more explicit about what’s going on.
Initially I wanted to disallow it, but @nalimilan convinced me to add it.
The first reason is that in this form we are type-stable and fast as opposed to by(some_fun, iris, :Species) which is slow.
The second reason is that in Pair as last argument only single value and vector are allowed (this is the same as kwarg form currently) on 0.20.2. Now the Pair form as a first argument is different and allows the function to return tables, like NamedTuple. And again - in the example above this is useful.
Additionally this allows map to be fast (it does not allow transformations as last arguments - they must be the first argument).
In summary - Pair as last argument is 100% consistent with select. Pair as first argument is a special case for special applications (where you want to return a table from a function not a single value or a vector or if you use map or if you need the operation to be fast).
Wow, sorry for the absence and thank you for the amazing attention. I’ll try to address most of the comments in chronological order:
The example code I included, and the one I assume you refer to here, came from DataFrames.jl’s own documentation (line 83 in DataFrames.jl/docs/src/man/split_apply_combine.md).
I’m ok with writing anything (thank you for your amazing work on DataFrames!)
This is my main problem. And solving it by
seems suboptimal since it involves spreading and then collecting for no “real” reason.
Neither wrapping stuff in a Ref is great cause then I can’t really refer to columns without unwrapping them.
Wrapping stuff in a vector, however, works great. I’m just wondering about the cost of creating all these one element vectors everywhere. I might be fussing unnecessarily.
After this point in the thread I feel the discussion steered towards the design of the syntax of the by function, but I might have missed some subtle way to allow a by-function to return vectors without auto-spreading (and thus also work for mixed return iterability-types)?
Long story short: It’s awesome that by can “auto-broadcast” stuff returned from the function, but I’m interested in sometimes avoiding that behavior (because the things in the cells of the DataFrame are to be considered as a singular entity, even if they are iterable) – either by wrapping things into a single element vector, or perhaps a flag?
The overhead of wrapping things in a vector is around 5 seconds for 10^7 groups, as it depends on the number of groups only not on the operation you perform (so the more complex things by does the lower the impact):