Combine with regex column selector?

I’m a bit puzzled by this example, where I expected the same column selector to work both for select as for combine. For some reason I’m unable to use the regex selector with combine:

julia> using DataFrames, Statistics

julia> df = DataFrame(a=repeat([1, 2, 3, 4], outer=[2]),
                               b1=repeat([2, 1], outer=[4]),
                               b2=1:8)

julia> combine(df, [:b1, :b2] .=> mean)
1×2 DataFrame
 Row │ b1_mean  b2_mean 
     │ Float64  Float64 
─────┼──────────────────
   1 │     1.5      4.5

julia> select(df, r"b\d")  # works
8×2 DataFrame
 Row │ b1     b2    
     │ Int64  Int64 
─────┼──────────────
   1 │     2      1
   2 │     1      2
   3 │     2      3
   4 │     1      4
   5 │     2      5
   6 │     1      6
   7 │     2      7

julia> combine(df, r"b\d" .=> mean)
ERROR: MethodError: objects of type Vector{Int64} are not callable
Use square brackets [] for indexing an Array.

Moreover, kind of confusing, AsTable works with the regex, but I expected one row with 2 numbers as output:

julia> combine(df, AsTable(r"b\d") .=> mean)
8×1 DataFrame
 Row │ b1_b2_mean 
     │ Float64    
─────┼────────────
   1 │        1.5
   2 │        1.5
   3 │        2.5
   4 │        2.5
   5 │        3.5
   6 │        3.5
   7 │        4.5
   8 │        4.5

I’m more surprised that select works with the regex selector, I tend to go for

combine(df, names(df, r"b\d") .=> mean)

I’m not sure why AsTable works with Regex. There could be a Base.Broadcastable definition somewhere. But for

julia> combine(df, r"b\d" .=> mean)

this is correct behavior. DataFrames.jl does not get to override Base broadcasting rules for Regex. That’s why Cols was recently modified to allow broadcasting.

EDIT: There is indeed a Base.Broadcastable definition. See here. Not sure the reason but I’m sure there is one.

in any case there seems to be an inconsistency with the docs at ?select, which says:

All these functions take a specification of one or more functions to apply to each subset of the DataFrame. This specification can be
of the following forms:

1. standard column selectors (integers, Symbols, strings, vectors of integers, vectors of Symbols, vectors of strings, All,
   Cols, :, Between, Not and regular expressions)

All these functions are select, transform and combine.

let’s not get hung up on the AsTable broadcast, that may be a different issue. I was just trying all sorts of stuff.

I don’t think this is wrong. You can do

foo(args...) = 1
select(df, r"a/b" => foo)

which doesn’t use broadcasting.

This doesn’t mention broadcasting at all. And broadcasting i think is the only thing making seemingly inconsistent behavior. But if working with collections and broadcasting is inconsistent within DataFrames.jl, that’s fine because its consistent with Base.

I think the philosophy on some of these docs is that Base rules take precedent. Issues like this might not need to be documented if they are downstream of Base behavior.

simple question is: how do I do this

combine(df, [:b1, :b2] .=> mean)

if I have 50 b’s from 1-50?

I would use Cols(r"b\d") .=> mean

ok yes that works. My point is: that was really hard to find out. If I see

select(df, r"b")

works, and I read the docs, I expect that

combine(df, r"b" .=> fun)

works as well. not really clear why one needs the Cols in one case and not the other.

Yes, that’s definitely frustrating.

You have to know that Regex isn’t broadcastable first, then things make sense.

Maybe Cols should be highlighted more in the docs because it exists to solve this exact problem.

ok i see. I somehow though that the regex just returns a vector of column names, but that’s obviously wrong - I guess that’s what Cols does. yeah that should be made explicit somewhere, because the regex selector is so important. I think.

1 Like

Remember that the major benefit of the src => fun => dest mini-language is that there is no magic, unlike dplyr. Everything gets evaluated as normal, meaning the src => fun => dest gets evaluated first, before DataFrames.jl can do anything with it.

1 Like

That is actually really helpful. mayb that’s a good strategy to compose a call to start with! I’ll try to remember this.

julia> [:a , :b] .=> mean .=> [:out1, :out2]
2-element Vector{Pair{Symbol, Pair{typeof(mean), Symbol}}}:
 :a => (Statistics.mean => :out1)
 :b => (Statistics.mean => :out2)

Actually, I was wrong about my explanation above. Regex is broadcasted like Ref.


julia> r"a" .=> [1, 2]
2-element Vector{Pair{Regex, Int64}}:
 r"a" => 1
 r"a" => 2

What’s going on is that your call in combine creates

julia> r"b\d" .=> mean
r"b\d" => Statistics.mean

Notice it’s not a vector… so I would expect it’s trying to do

mean([1, 2], [3, 4], [5,6])

or similar. This expectation is correct, and the error I get is

julia> combine(df, r"b\d" .=> mean)
ERROR: MethodError: no method matching mean(::Vector{Float64}, ::Vector{Float64}, ::Vector{Float64}, ::Vector{Float64},

which is different than your original error of

julia> combine(df, r"b\d" .=> mean)
ERROR: MethodError: objects of type Vector{Int64} are not callable
Use square brackets [] for indexing an Array.

so no, I don’t know what’s going on exactly. But the solution remains the same: Use Cols to broadcast Regex inside src => fun => dest

1 Like

Yes it’s annoying that regexes don’t work like Cols, Between and Not, but there’s no solution to that as it’s defined in Base. This kind of thing could be handled more consistently in DataFramesMeta, but wrapping in Cols is the easiest solution.

The error @floswald gets with combine(df, r"b\d" .=> mean) is due to calling mean(df.b1, df.b2), which interprets df.b1 as a function to apply to each element of df.b2. Unfortunately the error is cryptic, and varies according to the number of columns that match the regex.

AsTable is broadcastable to allow things such as applying a series of functions to each row in a table efficiently. See the last example of this section of the manual:

julia> transform(df, AsTable(:) .=>
                     ByRow.([sum∘skipmissing,
                             x -> count(!ismissing, x),
                             mean∘skipmissing]) .=>
                     [:sum, :n, :mean])
1 Like

Let me summarize the issue here as there are many things into play in it:

  • Regex can be used in normal operation specification like r"a" => fun, which passes columns selected by r"a" as positional arguments to fun
  • in broadcasting both Regex and AsTable are scalars so r"a" .=> fun and AsTable(...) .=> fun are just the same as r"a" => fun and AsTable(...) => fun. As @nalimilan commented - this works this way for Regex because Base Julia defines it this way. Also Colon() (:) behaves this way - it is defined to be a scalar in broadcasting in Base Julia so writing (:) .=> fun is the same as (:) => fun
  • On the other hand Cols, Not, Between, All are defined in data ecosystem and they expand the columns selected by them when broadcasted, in other words they are treated as vectors of names selected by them in broadcasting

Now the second aspect is mean, you need to understand its contract to see why things work this way. See it by example:

julia> mean([1,2], [3,4], [5,6]) # this is what happens if you write r"a" => mean
# Error

julia> mean((a1=[1,2], a2=[3,4], a3=[5,6])) # this is what happens if you write AsTable(r"a") => mean
2-element Vector{Float64}:
 3.0
 4.0

In summary:

  • all is consistent and correct
  • you just need to know and remember that Regex, AsTable, and : are scalars in broadcasting, while All, Cols, Not, and Between are vectors that contain columns selected by them.

This second special case is explained in Split-apply-combine · DataFrames.jl (the first case is not listed there as this is a default behavior in broadcasting of scalars in Base Julia). I copy-paste the relevant part:

Note! If cols or target_cols are one of All , Cols , Between , or Not , broadcasting using .=> is supported and is equivalent to broadcasting the result of names(df, cols) or names(df, target_cols) . This behaves as if broadcasting happened after replacing the selector with selected column names within the data frame scope.

1 Like

wow I had no idea that this even works. So much for that!

Thanks for the explanations, I think I understand most of it now. One has to know Base Julia well in order to use the minilanguage effectively. I guess I was making too quick conclusions from this

select(df, r"b")

to that

combine(df, r"b" .=> fun)

and said it’s inconsistent because it does not work this way. I guess the key mistake was to think that the second command first does something equivalent to the select (tease out some columns), in order to then operate on that selected dataframe - which is not wat’s happening. I think that writing out the src => fun part like @pdeffebach showed me is quite helpful for me.

will try to remember the golden rule (“you just need to know and remember…”) there! Thanks again! :slight_smile:

2 Likes

Agreed. However, in general when using Julia it is a valid statement if you want to be sure that Julia does what you expect :smile:. Broadcasting is powerful, but pretty complex if you dig deeper into it (fortunately in common cases it is easy).

1 Like

After this really enlightening discussion I feel vindicated for using names(df, r"... ") which transparently creates a simple vector of strings that even I can reason about :grinning:

4 Likes

I would like to add a topic, for possible further study, to this interesting discussion.
Instead of acting on the input to the left of => you can act on the right part (the function func) in the following ways, obtaining the average by rows or by columns.

combine(df, r"b\d" => (x...)->mean(x))

combine(df, r"b\d" => ((x...)->[mean.(x)...;;])=>AsTable)
combine(df, r"b\d" => ((x...)->[mean.(x)...]')=>AsTable)

which in the case of two columns can be written like this

combine(df, r"b\d" => (x,y)->mean((x,y)))
combine(df, r"b\d" => ((x,y)->mean.([x,y])')=>AsTable)
2 Likes