Why I get 'RefValue{SubArray{Int64' and not "simply" 'SubArray{Int64'

I have this in these cases of single columns

df = DataFrame(x = [1, 1, 2, 2], y = [1, 2, 101, 102]);
gd = groupby(df, :x);

julia> combine(gd,:x=>(X->Ref(X))=>:rx)
2×2 DataFrame
 Row │ x      rx        
     │ Int64  SubArray… 
─────┼──────────────────
   1 │     1  [1, 1]
   2 │     2  [2, 2]

julia> combine(gd,:x=>(X->tuple(X...))=>:rx)
2×2 DataFrame
 Row │ x      rx     
     │ Int64  Tuple… 
─────┼───────────────
   1 │     1  (1, 1)
   2 │     2  (2, 2)

Instead, if I try to extend to the case of multiple columns, I get this:

julia> combine(gd,[:x,:y]=>((X,Y)->[(tuple(X...),tuple(Y...))])=>[:rx,:ry])
2×3 DataFrame
 Row │ x      rx      ry
     │ Int64  Tuple…  Tuple…     
─────┼───────────────────────────
   1 │     1  (1, 1)  (1, 2)
   2 │     2  (2, 2)  (101, 102)

julia> combine(gd,[:x,:y]=>((X,Y)->[(Ref(X),Ref(Y))])=>[:rx,:ry])
2×3 DataFrame
 Row │ x      rx                                 ry
     │ Int64  RefValue…                          RefValue…
─────┼─────────────────────────────────────────────────────────────────────────────
   1 │     1  RefValue{SubArray{Int64,1,Array{…  RefValue{SubArray{Int64,1,Array{…
   2 │     2  RefValue{SubArray{Int64,1,Array{…  RefValue{SubArray{Int64,1,Array{…

using the vector form instead the result is

julia> combine(gd,[:x=>(X->Ref(extrema(X)))=>:rx,:y=>(X->Ref((sum(X),mean(X))))=>:ry])
2×3 DataFrame
 Row │ x      rx      ry
     │ Int64  Tuple…  Tuple…       
─────┼─────────────────────────────
   1 │     1  (1, 1)  (3, 1.5)
   2 │     2  (2, 2)  (203, 101.5)

I would like to ask where I can find examples of use of all the cases described here qui.
for cases where "standard column selectors (All , Cols , : , Between , Not and regular expressions)

Also I would like to see some example for this case :

What is allowed for function to return is determined by the target_cols value:

  1. If both cols and target_cols are omitted (so only a function is passed), then returning a data frame, a matrix, a NamedTuple , or a DataFrameRow will produce multiple columns in the result. Returning any other value produces a single column.

We definitely need more examples here. I’ve been meaning to make a table of all the different kinds of outputs. Here’s my explanation of each of the things you tried.

  • combine(gd,:x=>(X->Ref(X))=>:rx): Ref is a scalar, so the results do not get “stacked”. You get a collapse where each group gets one row.
  • combine(gd,:x=>(X->tuple(X...))=>:rx) Similar to the above, a Tuple does not get “stacked”. Only Vectors get stacked.
  • combine(gd,[:x,:y]=>((X,Y)->[(tuple(X...),tuple(Y...))])=>[:rx,:ry]). This one is really confusing. It has to do with the fact that a Vector of Tuples is a table-like object. When you tell DataFrames you want to return multiple columns, DataFrames calls Tables.columntable on the result here.
julia> Tables.columntable([(1, 2)])
(1 = [1], 2 = [2])

julia> Tables.columntable([((1, 2), (3, 4))])
(1 = [(1, 2)], 2 = [(3, 4)])

The second result is the same type of object returned from the original call.

  • combine(gd,[:x,:y]=>((X,Y)->[(Ref(X),Ref(Y))])=>[:rx,:ry]) This is the same as above, but with Ref instead of Tuples.
  • combine(gd,[:x=>(X->Ref(extrema(X)))=>:rx,:y=>(X->Ref((sum(X),mean(X))))=>:ry]). The Ref stuff is complicated here. This gets Ref gets “unwrapped” because it’s not nested in a Tuple like the above call is.
julia> Tables.columntable([Ref([1, 2])])
(x = [[1, 2]],)

julia> Tables.columntable([(Ref([1, 2]),)])
(1 = Base.RefValue{Array{Int64,1}}[Base.RefValue{Array{Int64,1}}([1, 2])],)

Note that just nesting in a Tuple is enough to get a Subarray without the Ref.

julia> combine(gd,[:x,:y]=>((X,Y)->[(X,Y)])=>[:rx,:ry])
2×3 DataFrame
 Row │ x      rx         ry         
     │ Int64  SubArray…  SubArray…  
─────┼──────────────────────────────
   1 │     1  [1, 1]     [1, 2]
   2 │     2  [2, 2]     [101, 102]

Maybe the way to think of it is that Tables.columntable only unnests one level. So if your Ref is inside a Tuple you will still get a Ref.

Does this help DataFrames.jl minilanguage explained | Blog by Bogumił Kamiński?

2 Likes

Thank you very much. I took a look at the Tables.columntable function, but my level of knowledge is not sufficient to understand the details. However, the summary you provided me is sufficient.
I had already seen that the following syntax gave the result I was looking for:

combine(gd,[:x,:y]=>((X,Y)->[(X,Y)])=>[:rx,:ry])

Actually, I had tried several others.
Some of these:

combine(gd,[:x,:y]=>((X,Y)->[[X] [Y]])=>[:arrx,:arry])
combine(gd,[:x,:y]=>((X,Y)->[tuple(X...) [Y]])=>[:tupx,:arry])
combine(gd,[:x,:y]=>((X,Y)->[(X...,) [Y]])=>[:tupx,:arry])
combine(gd,[:x,:y]=>((X,Y)->hcat([X],[Y]))=>[:hcatXYx,:hcatXYy])

PS

Actually I found myself doing these tests starting from a post on Slack where you talked about a transformation to obtain a column as a linear combination of other columns and you posted a message saying that using the macro version of DataframesMeta was very faster.
I would like to ask you, if possible to continue here or if useful I can open a new discussion, if you could explain the principles on which the Meta version is based in order to have execution in a shorter time

Surely. I found it much easier to follow. thanks.

However, I have not found an example with All.
I would like to ask if “All” and “:” are interchangeable.

We do not give example of All as it is not used much. It is only allowed to write All() and it is the same as :, so typically : is used, as it is defined in Julia Base (as opposed to All which is defined in DataAPI.jl only).

Nothing in DataFramesMeta will be faster than anything in DataFrames, since they all go down to the same code at the end of the day. One possible exception is that @eachrow will be faster than any code involving

for row in eachrow(df) 
    ...
end

This is because the iterator through rows of a data frame is not type stable. Whereas

@eachrow df begin 
    :a + :b
end

lowers (essentially) to

map(df.a, df.b) do a, b
    a + b
end

which is fast because there is a function barrier with broadcast and thus is type stable and fast.

But you can achieve this same kind of performance with ByRow in a transform call in DataFrames.

in an attempt to understand how ByRow works I tried to “simulate” it in the following cases illustrated in the linked post:

select(df, :name => ByRow(uppercase) => :NAME) == 
select(df, :name => (x -> uppercase.(x)) => :NAME)


select(df, names(df, r"grade") .=> ByRow(x -> x / 100), renamecols=false) == 
select(df, names(df, r"grade") .=> x -> x ./ 100, renamecols=false)

combine(df, :grade_1, :grade_2,AsTable([:grade_1, :grade_2]) => ByRow(x -> x.grade_1 > x.grade_2)=>:OgrtT)

combine(df, :grade_1, :grade_2, AsTable([:grade_1, :grade_2]) => x -> x.grade_1 > x.grade_2)
combine(df, :grade_1, :grade_2, AsTable([:grade_1, :grade_2]) => x -> x.grade_1 .> x.grade_2)

I couldn’t figure out how things go in the following case:

select(df, names(df, r"grade") => +, AsTable(names(df, r"grade")) => sum)

for the part

select(df, names(df, r"grade") => +)
julia> select(df, names(df, r"grade") => ByRow(sum))
ERROR: MethodError: no method matching sum(::Int64, ::Int64, ::Int64)

this seems work:

julia> select(df, names(df, r"grade") => sum ∘ tuple=>:somma)
6×1 DataFrame
 Row │ somma 
     │ Int64 
─────┼───────
   1 │   255
   2 │   255
   3 │   240
   4 │   265
   5 │   250
   6 │   270

instead this,


julia> select(df, names(df, r"grade") => (x->sum(tuple(x)))=>:somma)
ERROR: MethodError: no method matching (::var"#81#82")(::Array{Int64,1}, ::Array{Int64,1}, ::Array{Int64,1})
julia> select(df, names(df, r"grade") => x->sum.(zip(x)))
ERROR: MethodError: no method matching (::var"#83#84")(::Array{Int64,1}, ::Array{Int64,1}, ::Array{Int64,1})
julia> sum(tuple([1,2],[11,22],[111,222]))
2-element Array{Int64,1}:
 123
 246

julia> sum.(zip([1,2],[11,22],[111,222]))
2-element Array{Int64,1}:
 123
 246

the same functions applied to three vectors (*) seem to give the expected result in another context.

(*) I thought I understood that a list of three vectors came as input to the functions, but maybe that’s not the case.

I am not 100% sure what you wanted to ask, but most probably about the difference between:

sum ∘ tuple

and

x->sum(tuple(x))

the first function takes arbitrary number of arguments and turns them into a tuple, then applies sum to this Tuple.

The second function takes exactly one argument turns it into a 1-element Tuple and then calculates the sum of this one element (so this is essentially no-op assuming you pass something that has an addition defined).

1 Like

Thanks. You made it clear to me where the mistake I was making is: in case of multiple args I have tu use the … operator!?

julia> select(df, names(df, r"grade") => ((x...)->sum(tuple(x...)))=>:somma)
6×1 DataFrame
 Row │ somma 
     │ Int64 
─────┼───────
   1 │   255
   2 │   255
   3 │   240
   4 │   265
   5 │   250
   6 │   270

julia> select(df, names(df, r"grade") => (x...)->sum.(zip(x...)))
6×1 DataFrame
 Row │ grade_1_grade_2_grade_3_function 
     │ Int64
─────┼──────────────────────────────────
   1 │                              255
   2 │                              255
   3 │                              240
   4 │                              265
   5 │                              250
   6 │                              270

PS
I know that this is not the best way to do this. But I wanted to experiment with the various possibilities and understand well what is transferred in the various steps

PPS

copuld you please explain where the error is in this espression?

select(df, names(df, r"grade") => ByRow(sum))

instead this works

select(df, names(df, r"grade") => ByRow(+))

This seem to be the same thing @bkmins has pointed out. sum does not take a variable number of parameters, but just one container; and + takes a variable number of parameters, not just one container.

julia> sum(1, 2, 3, 4)
ERROR: MethodError: no method matching sum(::Int64, ::Int64, ::Int64, ::Int64)
Closest candidates are:
  sum(::Any, ::Any) at reduce.jl:494
  sum(::Any) at reduce.jl:511
  sum(::Any, ::AbstractArray; dims) at reducedim.jl:723
Stacktrace:
 [1] top-level scope at REPL[3]:1

julia> +(1, 2, 3, 4)
10
1 Like

Thanks for pointing this out.
Yes. This was one of the aspects that made it difficult for me to understand how everything works. But it still remains difficult for me to really understand how things are going.

Ad esempio:
Thinking about this expression that works

select(df, names(df, r"grade") => ((x...)->sum(tuple(x...)))=>:somma)

I thought that tuples (x …) are the combination of two somewhat opposite operations on the variable x and so I tried to change it like this and I saw that this works too

select(df, names(df, r"grade") => ((x...)->sum(x))=>:somma)

And also this works:

select(df, names(df, r"grade") => ByRow((x...) -> sum(x)) =>:somma) 

My difficulty is figuring out what comes to the function to the right of

names (df, r “grade”) =>

I thought it was some container of the three columns, i.e. a vector of vectors, a matrix or a tuple of three vectors.
But obviously this is not the case.

  1. You can replace momentarily your sum(x) by typeof(x), dump(x), or string(x) to create a column with textual representation of the type, so you can print it and check what is the adequate way of dealing with it.
  2. I am not sure if I understand your doubt. You are using x... in the anonymous function. The slurping operator is used when you know that the method is getting multiple parameters but you want to instead get just one parameter that is all arguments wrapped inside a tuple. So, let us refer to your anonymous function as f. f is being called as f(col1, col2, col3) (supposing you have three columns that match r"grade"), but because of slurping you get a tuple like (col1, col2, col3) (like there was no slurping and it was instead called like this: f( (col1, col2, col3) )). If you do sum( ([1, 2, 3], [4, 5, 6], [7, 8, 9]) ), you will see that it works, because sum expects a containers of elements to apply + over, and using + over two vectors create the vector with the elements summed element-wise. If are using ByRow, then instead of f being called one time and receiving whole columns (like my previous example), your function will be called multiple times and receive elements of the columns because the function is called one time for each row. Considering our previous example, what happens is: [sum( (1, 4, 7) ), sum( (2, 5, 9) ), sum( (3, 6, 9) )] that achieves the same result, just by a different path.
2 Likes

Thanks. I could not have hoped for a better explanation than this(Among other things, I used it but it was not clear to me how the slurping operator works.)

And thanks above all for the inspection “tools” you have indicated to me. I think I’ll make a lot of use of it.

1 Like