Writing a function that also works broadcasts over DataFrames

freeman · August 18, 2025, 10:03am

I love broadcasting over arrays. I love that in Julia I can define func(x) where x is e.g. a namedtuple and then call it on func.(x) where x is a vector of namedtuples.

However… I’m having difficulty extending this to DataFrames.jl - I expect this to be straightforwad as DataFrames have the same “shape” as vectors of namedtuples. But how exactly do I do this?

Assume that the implied interface to use this function is that the input is always guaranteed to have properties a and b. Here’s an example of one such function:

function func(x)
    x.a + x.b^2 + 5
end

df = DataFrame(
    a=[1,2,3],
    b=[1.,3,4],
    c=[10.,20.,30.],
)

func(df) errors because it expects that the properties are things that it can ^2, but if the input is a DataFrame that property is a Vector.

func.(df) also errors because it broadcasts over all cells, so it expects that the first row of the first columns (which is an Int64 in this case) has properties a and b, so it fails.

I’m aware of the Tables.jl interface, but that would require me to either have an if Tables.istable or define a new trait-like method the Tables case. I don’t like either of the options. Ideally I don’t want to write a vector-like func, I want some broadcast-like mechanism that would work on DataFrames as well. Defining a func method for DataFrames would be acceptable as long as it contained no logic, i.e. it should passback to the generic func. So for example, writing

function func(x::DataFrame)
    x.a .+ x.b .^ 2 .+ 5
end

would also not be a good solution.

EDIT: I wonder if there’s some lower level call of broadcast that would allow to control this. Checking

EDIT2: Oh look a blog post about broadcasting by Bogumil the DataFrames.jl guy Broadcast fusion in Julia: all you need to know to avoid pitfalls | Blog by Bogumił Kamiński Unfortunately doesn’t mention DataFrames

freeman · August 18, 2025, 10:32am

Found one approach.

func.(eachrow(df))

This is logically what I want, and gives the right result. However, I believe it’s not very performant. I believe this is literally a loop where you instantiate a DataFrameRow object, and pass that into the function. I think this isn’t the same as broadcast does, somehow. Look at the difference:

function func(x)
    x.a + x.b^2 + 5
end

df = DataFrame(a=rand(100_000), b=rand(100_000), c=rand(100_000))

@time predfunc.(eachrow(df));  #   0.065310 seconds (898.99 k allocations: 14.481 MiB, 56.60% gc time)

@time df.a .+ df.b .^2 .+ 5;  #   0.000385 seconds (12 allocations: 781.633 KiB)

TimG · August 18, 2025, 10:54am

Aren’t they named tuples of vectors?

freeman · August 18, 2025, 10:54am

I meant they’re the same “logical” shape. They don’t have the same memory layout, you’re right about that.

freeman · August 18, 2025, 11:04am

If fact, forget about shapes. It’s surprising to me that I have to write df.a .+ df.b .^ 2 .+ 5 when A) I already defined a function func(x) = x.a + x.b^2 + 5 and B) broadcasting exists. Sounds like we’re almost there?

eldee · August 18, 2025, 12:07pm

The fundamental problem is that the types in a DataFrameRow are not known:

help?> eachrow
(...)
  eachrow(df::AbstractDataFrame)

  Return a DataFrameRows that iterates a data frame row by row, with
  each row represented as a DataFrameRow.

  Because DataFrameRows have an eltype of Any, use
  copy(dfr::DataFrameRow) to obtain a named tuple, which supports
  iteration and property access like a DataFrameRow, but also passes
  information on the eltypes of the columns of df.

help?> DataFrameRow
(...)  
  A DataFrameRow supports the iteration interface and can therefore be
  passed to functions that expect a collection as an argument. Its
  element type is always Any.

But instead of eachrow, you can use Tables.namedtupleiterator, though this is not exported. See also Fast iteration over rows of a DataFrame - #10 by bkamins

julia> @btime func.(eachrow($df));
  21.160 ms (898990 allocations: 14.48 MiB)

julia> @btime func.(Tables.namedtupleiterator($df));
  616.200 μs (23 allocations: 3.05 MiB)

julia> @btime map(func, Tables.namedtupleiterator($df))  # Edit: added this. Not sure what's the difference with broadcast, but map is faster here (and does not make a difference for the other options)
  152.600 μs (20 allocations: 781.91 KiB)

julia> @btime $df.a .+ $df.b .^ 2 .+ 5;
  49.000 μs (12 allocations: 781.63 KiB)

Alternatively, you could construct e.g. a StructArray out of the DataFrame and perform your computations on that:

julia> using StructArrays

julia> function df_to_structarray(df::DataFrame)
           p = propertynames(df)  # e.g. [:a, :b, :c]
           nt = NamedTuple(zip(p, getindex.(Ref(df), !, p)))  # (a=df.a, b=df.b, c=df.c)
           return StructArray(nt)
       end

julia> @btime func.(df_to_structarray($df));
  54.200 μs (45 allocations: 782.97 KiB)

rafael.guerra · August 18, 2025, 1:01pm

Fyi, from the same author:

Tips and tricks of broadcasting in DataFrames.jl | Blog by Bogumił Kamiński

freeman · August 18, 2025, 1:12pm

Thank you for that. However, in that blog post all broadcast examples are “cell-wise” which isn’t what I want.

@TimG comment got me thinking. The difficulty I’m having is not specific to DataFrames, it’s the same for “structs of arrays” as well. For example, same idea but using a namedtuple of arrays instead of a DataFrame:

function func(x)
    x.a + x.b^2 + 5
end

x = (; a=rand(100_000), b=rand(100_000), c=rand(100_000))

func.(x)  # ERROR: ArgumentError: broadcasting over dictionaries and `NamedTuple`s is reserved

freeman · August 18, 2025, 1:14pm

Nice. So we could define

func(df::DataFrame) = map(func, Tables.namedtupleiterator(df));

This is progress I think

JonasWickman · August 18, 2025, 1:28pm

My intuition would be to think of your ‘inner’ func as a function in two arguments and write:

func(a, b) = a + b + 5
func(x::DataFrame) = func.(x.a, x.b)
df = DataFrame(a = rand(100_000), b = rand(100_000))
func(df)

This will get a bit unwieldy if there are a lot of arguments though. Still, this is sort of what the column-oriented layout of a dataframe seems centered around.

freeman · August 18, 2025, 4:49pm

Exactly! that’s exactly why I came here with this question
Have you seen eldee’s suggestion? I think that’s the best we have so far.

pdeffebach · August 18, 2025, 6:18pm

This is likely to be slow, actually. DataFramesMeta has the @with macro which makes an anonymous function to do this quickly.

julia> using DataFramesMeta, Tables, BenchmarkTools

julia> df = DataFrame(a=rand(100_000), b=rand(100_000), c=rand(100_000));

julia> f1(df) = func.(eachrow(df));

julia> f2(df) = func.(Tables.namedtupleiterator(df));

julia> f3(df) = map(func, Tables.namedtupleiterator(df));

julia> f4(df) = (@with df begin :a .+ :b .^ 2 .+ 5 end);

julia> @btime f1($df);
  8.228 ms (898990 allocations: 14.50 MiB)

julia> @btime f2($df);
  165.875 μs (22 allocations: 3.09 MiB)

julia> @btime f3($df);
  81.416 μs (20 allocations: 800.66 KiB)

julia> @btime f4($df);
  19.625 μs (18 allocations: 800.62 KiB)

If you want to do things inside dataframes only, there is also AsTable inside the src => fun => dest syntax of DataFrames.

julia> f5(df) = @rselect(df, :_z = func(AsTable(:)))[!,1]; # Return a vector

julia> @btime f5($df);
  90.333 μs (123 allocations: 804.92 KiB)

StructArrays does this the best, imo. It clearly has some optimization making it faster than even broadcasting!

julia> function df_to_structarray(df::DataFrame)
           p = propertynames(df)  # e.g. [:a, :b, :c]
           nt = NamedTuple(zip(p, getindex.(Ref(df), !, p)))  # (a=df.a, b=df.b, c=df.c)
           return StructArray(nt)
       end;

julia> f6(sa) = func.(sa);

julia> @btime f6($sa);
  12.666 μs (3 allocations: 800.06 KiB)

eldee · August 18, 2025, 7:01pm

Could you explain why? On my machine this remains the fastest option, closely followed by @with and StructArray (if you include the conversion). It’s also conspicuously missing from your timings .

freeman · August 18, 2025, 7:47pm

Here’s my own timings on my 7 year old cpu.

using DataFrames
using BenchmarkTools
using StructArrays

df = DataFrame(a=rand(100_000), b=rand(100_000), c=rand(100_000));

func(x) = x.a + x.b^2 + 5

f1(df) = func.(eachrow(df));

f2(df) = func.(Tables.namedtupleiterator(df));

f3(df) = map(func, Tables.namedtupleiterator(df));

@btime f1($df);  #   19.397 ms (898990 allocations: 14.48 MiB)

@btime f2($df);  #   522.568 μs (31 allocations: 3.05 MiB)

@btime f3($df);  #   166.983 μs (20 allocations: 781.91 KiB)

function df_to_structarray(df::DataFrame)
    p = propertynames(df)  # e.g. [:a, :b, :c]
    nt = NamedTuple(zip(p, getindex.(Ref(df), !, p)))  # (a=df.a, b=df.b, c=df.c)
    return StructArray(nt)
end;

f6(sa) = func.(sa);
sa = df_to_structarray(df);

@btime f6($sa);  #   103.388 μs (3 allocations: 781.32 KiB)
# f6 is the fastest but doesn't qualify because it doesn't actually take DataFrames

f7(df) = func.(df_to_structarray(df));

@btime f7($df);  #   112.665 μs (49 allocations: 783.12 KiB)

I’m skipping f4 because it doesn’t qualify and f5 because it looks too ugly

Lets say that manually writing the broadcast is the baseline for this exercise:

@btime $df.a .+ $df.b .^ 2 .+ 5;  #  107.527 μs (12 allocations: 781.63 KiB)

So in my view the fastest solution that qualifies is convert to SA and broadcast on that. Can’t say it’s pretty though

aplavin · August 18, 2025, 8:03pm

For “structs of arrays”, there’s the great StructArrays.jl package. It makes it seamless to switch between row- and column-based storage for tables, while keeping the exact same API: just use Vector or StructVector of namedtuples.

Of course, broadcasting (and other Julia functionality) works just as you would expect with structarrays.

freeman:

function df_to_structarray(df::DataFrame)
    p = propertynames(df)  # e.g. [:a, :b, :c]
    nt = NamedTuple(zip(p, getindex.(Ref(df), !, p)))  # (a=df.a, b=df.b, c=df.c)
    return StructArray(nt)
end;

Could probably just be StructArray(columntable(df))?
But more generally, you may want to consider just using arrays/structarrays of namedtuples for your tables

pdeffebach · August 18, 2025, 8:25pm

Oh I guess @btime might be performing some sort of optimization. But in general

df.a .* df.b .+ df.c

might be slow because Julia doesn’t know what type df.a etc. is. DataFrames are not typed (and that’s a good thing!). Fortunately a function barrier solves this, and

_f(a, b, c) = a .* b .+ c
_f(df.a, df.b, df.c)

solves this problem. The @with macro is just a convenient way to do that process.

eldee · August 18, 2025, 9:07pm

I’m pretty sure the broadcasting itself already serves as the function barrier, i.e.
df.a .+ df.b .^ 2 .+ 5 is equivalent to _f(a, b) = a + b^2 + 5; broadcast(_f, df.a, df.b). But it is indeed useful to point of that something like

function manual_func_loop(df)
    v = Vector{Float64}(undef, nrow(df))
    for i = 1:nrow(df)
        v[i] = df.a[i] + df.b[i]^2 + 5  # (or  v[i] = func(df.a[i], df.b[i], df.c[i]) )
    end
    return v
end

will have terrible performance:

julia> @btime manual_func_loop($df);
  20.713 ms (798472 allocations: 12.95 MiB)

(as typeof(DataFrames._columns(df)) === Vector{AbstractVector}).

This abstractness is presumably also why f6 is fastest (without the conversion to StructArray), as you (unfairly) avoid a dynamic dispatch. Though a single one does not make much of a performance difference.

Topic		Replies	Views
Elegant ways to broadcast the same function to each column replacing the original column in DataFrames.jl New to Julia dataframes	9	1118	May 22, 2021
Iterating over a DataFrame New to Julia iterative , dataframes , function	2	734	May 26, 2021
DataFrames.jl - Vectorized row-wise function application Data dataframes	3	3870	October 13, 2018
Broadcast transformed data from single row to multiple columns General Usage dataframes , dataframesmeta	13	618	December 7, 2022
Mapping Vector{MyType} to a DataFrame General Usage	13	2865	June 3, 2019

Writing a function that also works broadcasts over DataFrames

Related topics