Assigning multiple values to a Dataframe row using broadcasting

00krishna · March 18, 2025, 4:03pm

I am having some trouble using a vectorized function and assigning the output to a DataFrames row. Here is a simple MWE below. The function computes multiple values as the output, and I wanted to assign each element of the output to different rows of the dataframe. Instead, I am getting that the entire tuple of function outputs gets assigned to each column of the dataframe.

This should be a simple syntax thing, but I am missing it.

function myfunc(a, b)
    return string(a+b), a, b
end

testdata = DataFrame(rand(5, 4), :auto)
testdata[!, 1] = ["one", "two", "three", "four", "five"]
testdata[!, 2:4] .= myfunc.(collect(1:5), collect(2:6))

But the output looks like this:

5×4 DataFrame
 Row │ x1      x2            x3            x4           
     │ String  Tuple…        Tuple…        Tuple…       
─────┼──────────────────────────────────────────────────
   1 │ one     ("3", 1, 2)   ("3", 1, 2)   ("3", 1, 2)
   2 │ two     ("5", 2, 3)   ("5", 2, 3)   ("5", 2, 3)
   3 │ three   ("7", 3, 4)   ("7", 3, 4)   ("7", 3, 4)
   4 │ four    ("9", 4, 5)   ("9", 4, 5)   ("9", 4, 5)
   5 │ five    ("11", 5, 6)  ("11", 5, 6)  ("11", 5, 6)

I actually want the output to look like this:

5×4 DataFrame
 Row │ x1      x2            x3            x4           
     │ String  Tuple…        Tuple…        Tuple…       
─────┼──────────────────────────────────────────────────
   1 │ one     "3"           1             2
   2 │ two     "5"           2             3
   3 │ three   "7"           3             4
   4 │ four    "9"           4             5
   5 │ five    "11"          5             6

Thanks for any input.

hendri54 · March 18, 2025, 5:17pm

You don’t want to just use a for loop?

# Changing preallocation to get the right DataTypes
julia> testdata = DataFrame(
               x1 = ["one", "two", "three", "four", "five"],
               x2 = fill("", 5),
               x3 = zeros(Int,5), x4 = zeros(Int,5));

julia> for (j, ab) in enumerate(zip(1:5, 2:6))
               testdata[j,2:4] = myfunc(ab...);
       end

julia> testdata
5×4 DataFrame
 Row │ x1      x2      x3     x4
     │ String  String  Int64  Int64
─────┼──────────────────────────────
   1 │ one     3           1      2
   2 │ two     5           2      3
   3 │ three   7           3      4
   4 │ four    9           4      5
   5 │ five    11          5      6

00krishna · March 18, 2025, 5:19pm

I could totally do it as a for loop. I was just thinking that keeping the nice vectorized structure would be faster and more Julian.

I can use a for loop as a workaround till I know whether a vectorized way is possible though. Thanks for the suggestion.

rocco_sprmnt21 · March 18, 2025, 6:30pm

While I find the request unclear, this is what I think is possibly close to what you were thinking.

testdata[!, 2:4] .= stack(myfunc.(collect(1:5), collect(2:6)),dims=1)

00krishna · March 18, 2025, 6:48pm

Yeah, this worked. Can you explain why this worked, if you don’t mind. I don’t think I have had to use stack before for this purpose. I actually had to explicitly write Base.stack().

rocco_sprmnt21 · March 18, 2025, 6:54pm

I’m not sure I understand the meaning of your question, but I’ll try.
I think the problem is that in Julia the data tables (even in matrix form) follow the order by column. That is, the second value is the first of the second row.
So to make broad casting work for a group of columns of a dataframe you have to put the data in the “correct” order.
Stack does this. It could be done more laboriously with the zip function and others.

rocco_sprmnt21 · March 18, 2025, 6:58pm

This probably depends on the version of Julia you are using.
I am on 1.10.somenumber

00krishna · March 18, 2025, 7:15pm

I think I understand. Julia is column major, so the data tables (like matrix), iterate down the column. It would be interesting to understand the internals. But thanks again for the working code here. That gets me past my present issue.

hendri54 · March 18, 2025, 8:49pm

The question is whether the one-liner using stack is “better” than the loop.

I suspect that it would allocate two Matrices, but I’m not sure because of the broadcasting.
If the goal is speed, I would benchmark against the loop before deciding.

Alternatively, one could probably make a one-liner that’s fast using Tullio.jl or TensorCast.jl

trung · March 18, 2025, 10:42pm

I tried both codes and used the @btime and the for loop solution was faster and had a smaller allocation than the one line solution

00krishna · March 27, 2025, 10:28pm

@trung thanks for trying this out. Yeah I will try this both ways as well and see if I can verify your findings on speed. Thanks for the first pass here, it is nice that you tackled the problem in a principled way.

00krishna · March 27, 2025, 10:29pm

Thanks for the analysis here. I will have to make a note that the loop approach is faster, and try benchmarking it myself. I guess there is a tradeoff between the speed of a raw loop versus sometimes the economy of the single line solution. Good to know that my assumptions were off here, thinking that the one liner would be faster.

Dan · March 27, 2025, 11:56pm

In an attempt to minimize allocations:

using StructArrays

sv = (Vector{String}(undef,5), Vector{Int}(undef, 5), Vector{Int}(undef, 5))
sa = StructArray(sv)
sa .= (myfunc(a,b) for (a,b) in zip(1:5,2:6))
for i in 1:3 
    testdata[!,i+1] = sv[i] 
end

achieves the same testdata. The trick is to unpack the myfunc outputs directly into separate vectors which can then replace the testdata data vectors. Essentially, this may save an allocation and a transpose operation.

According to my benchmark it performs better than stack solution.

Topic		Replies	Views
Broadcast transformed data from single row to multiple columns General Usage dataframes , dataframesmeta	13	569	December 7, 2022
Iterating over a DataFrame New to Julia iterative , dataframes , function	2	716	May 26, 2021
Elegant ways to broadcast the same function to each column replacing the original column in DataFrames.jl New to Julia dataframes	9	1088	May 22, 2021
DataFrame construction from array of tuples General Usage data	12	7117	November 28, 2022
Iterating over row in a DataFrame New to Julia	8	5257	December 12, 2020

Assigning multiple values to a Dataframe row using broadcasting

Related topics