Assigning multiple values to a Dataframe row using broadcasting

I am having some trouble using a vectorized function and assigning the output to a DataFrames row. Here is a simple MWE below. The function computes multiple values as the output, and I wanted to assign each element of the output to different rows of the dataframe. Instead, I am getting that the entire tuple of function outputs gets assigned to each column of the dataframe.

This should be a simple syntax thing, but I am missing it.

function myfunc(a, b)
    return string(a+b), a, b
end

testdata = DataFrame(rand(5, 4), :auto)
testdata[!, 1] = ["one", "two", "three", "four", "five"]
testdata[!, 2:4] .= myfunc.(collect(1:5), collect(2:6))

But the output looks like this:

5Γ—4 DataFrame
 Row β”‚ x1      x2            x3            x4           
     β”‚ String  Tuple…        Tuple…        Tuple…       
─────┼──────────────────────────────────────────────────
   1 β”‚ one     ("3", 1, 2)   ("3", 1, 2)   ("3", 1, 2)
   2 β”‚ two     ("5", 2, 3)   ("5", 2, 3)   ("5", 2, 3)
   3 β”‚ three   ("7", 3, 4)   ("7", 3, 4)   ("7", 3, 4)
   4 β”‚ four    ("9", 4, 5)   ("9", 4, 5)   ("9", 4, 5)
   5 β”‚ five    ("11", 5, 6)  ("11", 5, 6)  ("11", 5, 6)

I actually want the output to look like this:

5Γ—4 DataFrame
 Row β”‚ x1      x2            x3            x4           
     β”‚ String  Tuple…        Tuple…        Tuple…       
─────┼──────────────────────────────────────────────────
   1 β”‚ one     "3"           1             2
   2 β”‚ two     "5"           2             3
   3 β”‚ three   "7"           3             4
   4 β”‚ four    "9"           4             5
   5 β”‚ five    "11"          5             6

Thanks for any input.

You don’t want to just use a for loop?

# Changing preallocation to get the right DataTypes
julia> testdata = DataFrame(
               x1 = ["one", "two", "three", "four", "five"],
               x2 = fill("", 5),
               x3 = zeros(Int,5), x4 = zeros(Int,5));

julia> for (j, ab) in enumerate(zip(1:5, 2:6))
               testdata[j,2:4] = myfunc(ab...);
       end

julia> testdata
5Γ—4 DataFrame
 Row β”‚ x1      x2      x3     x4
     β”‚ String  String  Int64  Int64
─────┼──────────────────────────────
   1 β”‚ one     3           1      2
   2 β”‚ two     5           2      3
   3 β”‚ three   7           3      4
   4 β”‚ four    9           4      5
   5 β”‚ five    11          5      6

I could totally do it as a for loop. I was just thinking that keeping the nice vectorized structure would be faster and more Julian.

I can use a for loop as a workaround till I know whether a vectorized way is possible though. Thanks for the suggestion.

While I find the request unclear, this is what I think is possibly close to what you were thinking.

testdata[!, 2:4] .= stack(myfunc.(collect(1:5), collect(2:6)),dims=1)
1 Like

Yeah, this worked. Can you explain why this worked, if you don’t mind. I don’t think I have had to use stack before for this purpose. I actually had to explicitly write Base.stack().

I’m not sure I understand the meaning of your question, but I’ll try.
I think the problem is that in Julia the data tables (even in matrix form) follow the order by column. That is, the second value is the first of the second row.
So to make broad casting work for a group of columns of a dataframe you have to put the data in the β€œcorrect” order.
Stack does this. It could be done more laboriously with the zip function and others.

1 Like

This probably depends on the version of Julia you are using.
I am on 1.10.somenumber

1 Like

I think I understand. Julia is column major, so the data tables (like matrix), iterate down the column. It would be interesting to understand the internals. But thanks again for the working code here. That gets me past my present issue.

The question is whether the one-liner using stack is β€œbetter” than the loop.

I suspect that it would allocate two Matrices, but I’m not sure because of the broadcasting.
If the goal is speed, I would benchmark against the loop before deciding.

Alternatively, one could probably make a one-liner that’s fast using Tullio.jl or TensorCast.jl

1 Like

I tried both codes and used the @btime and the for loop solution was faster and had a smaller allocation than the one line solution

1 Like

@trung thanks for trying this out. Yeah I will try this both ways as well and see if I can verify your findings on speed. Thanks for the first pass here, it is nice that you tackled the problem in a principled way.

Thanks for the analysis here. I will have to make a note that the loop approach is faster, and try benchmarking it myself. I guess there is a tradeoff between the speed of a raw loop versus sometimes the economy of the single line solution. Good to know that my assumptions were off here, thinking that the one liner would be faster.

In an attempt to minimize allocations:

using StructArrays

sv = (Vector{String}(undef,5), Vector{Int}(undef, 5), Vector{Int}(undef, 5))
sa = StructArray(sv)
sa .= (myfunc(a,b) for (a,b) in zip(1:5,2:6))
for i in 1:3 
    testdata[!,i+1] = sv[i] 
end

achieves the same testdata. The trick is to unpack the myfunc outputs directly into separate vectors which can then replace the testdata data vectors. Essentially, this may save an allocation and a transpose operation.

According to my benchmark it performs better than stack solution.