I’m a bit puzzled by this example, where I expected the same column selector to work both for select as for combine. For some reason I’m unable to use the regex selector with combine:
I’m not sure why AsTable works with Regex. There could be a Base.Broadcastable definition somewhere. But for
julia> combine(df, r"b\d" .=> mean)
this is correct behavior. DataFrames.jl does not get to override Base broadcasting rules for Regex. That’s why Cols was recently modified to allow broadcasting.
EDIT: There is indeed a Base.Broadcastable definition. See here. Not sure the reason but I’m sure there is one.
in any case there seems to be an inconsistency with the docs at ?select, which says:
All these functions take a specification of one or more functions to apply to each subset of the DataFrame. This specification can be
of the following forms:
1. standard column selectors (integers, Symbols, strings, vectors of integers, vectors of Symbols, vectors of strings, All,
Cols, :, Between, Not and regular expressions)
All these functions are select, transform and combine.
This doesn’t mention broadcasting at all. And broadcasting i think is the only thing making seemingly inconsistent behavior. But if working with collections and broadcasting is inconsistent within DataFrames.jl, that’s fine because its consistent with Base.
I think the philosophy on some of these docs is that Base rules take precedent. Issues like this might not need to be documented if they are downstream of Base behavior.
ok i see. I somehow though that the regex just returns a vector of column names, but that’s obviously wrong - I guess that’s what Cols does. yeah that should be made explicit somewhere, because the regex selector is so important. I think.
Remember that the major benefit of the src => fun => dest mini-language is that there is no magic, unlike dplyr. Everything gets evaluated as normal, meaning the src => fun => dest gets evaluated first, before DataFrames.jl can do anything with it.
Yes it’s annoying that regexes don’t work like Cols, Between and Not, but there’s no solution to that as it’s defined in Base. This kind of thing could be handled more consistently in DataFramesMeta, but wrapping in Cols is the easiest solution.
The error @floswald gets with combine(df, r"b\d" .=> mean) is due to calling mean(df.b1, df.b2), which interprets df.b1 as a function to apply to each element of df.b2. Unfortunately the error is cryptic, and varies according to the number of columns that match the regex.
AsTable is broadcastable to allow things such as applying a series of functions to each row in a table efficiently. See the last example of this section of the manual:
Let me summarize the issue here as there are many things into play in it:
Regex can be used in normal operation specification like r"a" => fun, which passes columns selected by r"a" as positional arguments to fun
in broadcasting both Regex and AsTable are scalars so r"a" .=> fun and AsTable(...) .=> fun are just the same as r"a" => fun and AsTable(...) => fun. As @nalimilan commented - this works this way for Regex because Base Julia defines it this way. Also Colon() (:) behaves this way - it is defined to be a scalar in broadcasting in Base Julia so writing (:) .=> fun is the same as (:) => fun
On the other hand Cols, Not, Between, All are defined in data ecosystem and they expand the columns selected by them when broadcasted, in other words they are treated as vectors of names selected by them in broadcasting
Now the second aspect is mean, you need to understand its contract to see why things work this way. See it by example:
julia> mean([1,2], [3,4], [5,6]) # this is what happens if you write r"a" => mean
# Error
julia> mean((a1=[1,2], a2=[3,4], a3=[5,6])) # this is what happens if you write AsTable(r"a") => mean
2-element Vector{Float64}:
3.0
4.0
In summary:
all is consistent and correct
you just need to know and remember that Regex, AsTable, and : are scalars in broadcasting, while All, Cols, Not, and Between are vectors that contain columns selected by them.
This second special case is explained in Split-apply-combine · DataFrames.jl (the first case is not listed there as this is a default behavior in broadcasting of scalars in Base Julia). I copy-paste the relevant part:
Note! If cols or target_cols are one of All , Cols , Between , or Not , broadcasting using .=> is supported and is equivalent to broadcasting the result of names(df, cols) or names(df, target_cols) . This behaves as if broadcasting happened after replacing the selector with selected column names within the data frame scope.
wow I had no idea that this even works. So much for that!
Thanks for the explanations, I think I understand most of it now. One has to know Base Julia well in order to use the minilanguage effectively. I guess I was making too quick conclusions from this
select(df, r"b")
to that
combine(df, r"b" .=> fun)
and said it’s inconsistent because it does not work this way. I guess the key mistake was to think that the second command first does something equivalent to the select (tease out some columns), in order to then operate on that selected dataframe - which is not wat’s happening. I think that writing out the src => fun part like @pdeffebach showed me is quite helpful for me.
will try to remember the golden rule (“you just need to know and remember…”) there! Thanks again!
Agreed. However, in general when using Julia it is a valid statement if you want to be sure that Julia does what you expect . Broadcasting is powerful, but pretty complex if you dig deeper into it (fortunately in common cases it is easy).
After this really enlightening discussion I feel vindicated for using names(df, r"... ") which transparently creates a simple vector of strings that even I can reason about
I would like to add a topic, for possible further study, to this interesting discussion.
Instead of acting on the input to the left of => you can act on the right part (the function func) in the following ways, obtaining the average by rows or by columns.