Writing a function that also works broadcasts over DataFrames

I love broadcasting over arrays. I love that in Julia I can define func(x) where x is e.g. a namedtuple and then call it on func.(x) where x is a vector of namedtuples.

However… I’m having difficulty extending this to DataFrames.jl - I expect this to be straightforwad as DataFrames have the same “shape” as vectors of namedtuples. But how exactly do I do this?

Assume that the implied interface to use this function is that the input is always guaranteed to have properties a and b. Here’s an example of one such function:

function func(x)
    x.a + x.b^2 + 5
end

df = DataFrame(
    a=[1,2,3],
    b=[1.,3,4],
    c=[10.,20.,30.],
)

func(df) errors because it expects that the properties are things that it can ^2, but if the input is a DataFrame that property is a Vector.

func.(df) also errors because it broadcasts over all cells, so it expects that the first row of the first columns (which is an Int64 in this case) has properties a and b, so it fails.

I’m aware of the Tables.jl interface, but that would require me to either have an if Tables.istable or define a new trait-like method the Tables case. I don’t like either of the options. Ideally I don’t want to write a vector-like func, I want some broadcast-like mechanism that would work on DataFrames as well. Defining a func method for DataFrames would be acceptable as long as it contained no logic, i.e. it should passback to the generic func. So for example, writing

function func(x::DataFrame)
    x.a .+ x.b .^ 2 .+ 5
end

would also not be a good solution.

EDIT: I wonder if there’s some lower level call of broadcast that would allow to control this. Checking

EDIT2: Oh look a blog post about broadcasting by Bogumil the DataFrames.jl guy :slight_smile: Broadcast fusion in Julia: all you need to know to avoid pitfalls | Blog by Bogumił Kamiński Unfortunately doesn’t mention DataFrames

Found one approach.

func.(eachrow(df))

This is logically what I want, and gives the right result. However, I believe it’s not very performant. I believe this is literally a loop where you instantiate a DataFrameRow object, and pass that into the function. I think this isn’t the same as broadcast does, somehow. Look at the difference:

function func(x)
    x.a + x.b^2 + 5
end

df = DataFrame(a=rand(100_000), b=rand(100_000), c=rand(100_000))

@time predfunc.(eachrow(df));  #   0.065310 seconds (898.99 k allocations: 14.481 MiB, 56.60% gc time)

@time df.a .+ df.b .^2 .+ 5;  #   0.000385 seconds (12 allocations: 781.633 KiB)

Aren’t they named tuples of vectors?

I meant they’re the same “logical” shape. They don’t have the same memory layout, you’re right about that.

If fact, forget about shapes. It’s surprising to me that I have to write df.a .+ df.b .^ 2 .+ 5 when A) I already defined a function func(x) = x.a + x.b^2 + 5 and B) broadcasting exists. Sounds like we’re almost there?

The fundamental problem is that the types in a DataFrameRow are not known:

help?> eachrow
(...)
  eachrow(df::AbstractDataFrame)

  Return a DataFrameRows that iterates a data frame row by row, with
  each row represented as a DataFrameRow.

  Because DataFrameRows have an eltype of Any, use
  copy(dfr::DataFrameRow) to obtain a named tuple, which supports
  iteration and property access like a DataFrameRow, but also passes
  information on the eltypes of the columns of df.

help?> DataFrameRow
(...)  
  A DataFrameRow supports the iteration interface and can therefore be
  passed to functions that expect a collection as an argument. Its
  element type is always Any.

But instead of eachrow, you can use Tables.namedtupleiterator, though this is not exported. See also Fast iteration over rows of a DataFrame - #10 by bkamins

julia> @btime func.(eachrow($df));
  21.160 ms (898990 allocations: 14.48 MiB)

julia> @btime func.(Tables.namedtupleiterator($df));
  616.200 μs (23 allocations: 3.05 MiB)

julia> @btime map(func, Tables.namedtupleiterator($df))  # Edit: added this. Not sure what's the difference with broadcast, but map is faster here (and does not make a difference for the other options)
  152.600 μs (20 allocations: 781.91 KiB)

julia> @btime $df.a .+ $df.b .^ 2 .+ 5;
  49.000 μs (12 allocations: 781.63 KiB)

Alternatively, you could construct e.g. a StructArray out of the DataFrame and perform your computations on that:

julia> using StructArrays

julia> function df_to_structarray(df::DataFrame)
           p = propertynames(df)  # e.g. [:a, :b, :c]
           nt = NamedTuple(zip(p, getindex.(Ref(df), !, p)))  # (a=df.a, b=df.b, c=df.c)
           return StructArray(nt)
       end

julia> @btime func.(df_to_structarray($df));
  54.200 μs (45 allocations: 782.97 KiB)
1 Like

Fyi, from the same author:

Tips and tricks of broadcasting in DataFrames.jl | Blog by Bogumił Kamiński

1 Like

Thank you for that. However, in that blog post all broadcast examples are “cell-wise” which isn’t what I want.

@TimG comment got me thinking. The difficulty I’m having is not specific to DataFrames, it’s the same for “structs of arrays” as well. For example, same idea but using a namedtuple of arrays instead of a DataFrame:

function func(x)
    x.a + x.b^2 + 5
end

x = (; a=rand(100_000), b=rand(100_000), c=rand(100_000))

func.(x)  # ERROR: ArgumentError: broadcasting over dictionaries and `NamedTuple`s is reserved

Nice. So we could define

func(df::DataFrame) = map(func, Tables.namedtupleiterator(df));

This is progress I think :slight_smile:

1 Like

My intuition would be to think of your ‘inner’ func as a function in two arguments and write:

func(a, b) = a + b + 5
func(x::DataFrame) = func.(x.a, x.b)
df = DataFrame(a = rand(100_000), b = rand(100_000))
func(df)

This will get a bit unwieldy if there are a lot of arguments though. Still, this is sort of what the column-oriented layout of a dataframe seems centered around.

1 Like

Exactly! that’s exactly why I came here with this question :slight_smile:
Have you seen eldee’s suggestion? I think that’s the best we have so far.

This is likely to be slow, actually. DataFramesMeta has the @with macro which makes an anonymous function to do this quickly.

julia> using DataFramesMeta, Tables, BenchmarkTools

julia> df = DataFrame(a=rand(100_000), b=rand(100_000), c=rand(100_000));

julia> f1(df) = func.(eachrow(df));

julia> f2(df) = func.(Tables.namedtupleiterator(df));

julia> f3(df) = map(func, Tables.namedtupleiterator(df));

julia> f4(df) = (@with df begin :a .+ :b .^ 2 .+ 5 end);

julia> @btime f1($df);
  8.228 ms (898990 allocations: 14.50 MiB)

julia> @btime f2($df);
  165.875 μs (22 allocations: 3.09 MiB)

julia> @btime f3($df);
  81.416 μs (20 allocations: 800.66 KiB)

julia> @btime f4($df);
  19.625 μs (18 allocations: 800.62 KiB)

If you want to do things inside dataframes only, there is also AsTable inside the src => fun => dest syntax of DataFrames.

julia> f5(df) = @rselect(df, :_z = func(AsTable(:)))[!,1]; # Return a vector

julia> @btime f5($df);
  90.333 μs (123 allocations: 804.92 KiB)

StructArrays does this the best, imo. It clearly has some optimization making it faster than even broadcasting!

julia> function df_to_structarray(df::DataFrame)
           p = propertynames(df)  # e.g. [:a, :b, :c]
           nt = NamedTuple(zip(p, getindex.(Ref(df), !, p)))  # (a=df.a, b=df.b, c=df.c)
           return StructArray(nt)
       end;

julia> f6(sa) = func.(sa);

julia> @btime f6($sa);
  12.666 μs (3 allocations: 800.06 KiB)

Could you explain why? On my machine this remains the fastest option, closely followed by @with and StructArray (if you include the conversion). It’s also conspicuously missing from your timings :slight_smile: .

Here’s my own timings on my 7 year old cpu.

using DataFrames
using BenchmarkTools
using StructArrays

df = DataFrame(a=rand(100_000), b=rand(100_000), c=rand(100_000));

func(x) = x.a + x.b^2 + 5

f1(df) = func.(eachrow(df));

f2(df) = func.(Tables.namedtupleiterator(df));

f3(df) = map(func, Tables.namedtupleiterator(df));

@btime f1($df);  #   19.397 ms (898990 allocations: 14.48 MiB)

@btime f2($df);  #   522.568 μs (31 allocations: 3.05 MiB)

@btime f3($df);  #   166.983 μs (20 allocations: 781.91 KiB)

function df_to_structarray(df::DataFrame)
    p = propertynames(df)  # e.g. [:a, :b, :c]
    nt = NamedTuple(zip(p, getindex.(Ref(df), !, p)))  # (a=df.a, b=df.b, c=df.c)
    return StructArray(nt)
end;

f6(sa) = func.(sa);
sa = df_to_structarray(df);

@btime f6($sa);  #   103.388 μs (3 allocations: 781.32 KiB)
# f6 is the fastest but doesn't qualify because it doesn't actually take DataFrames

f7(df) = func.(df_to_structarray(df));

@btime f7($df);  #   112.665 μs (49 allocations: 783.12 KiB)

I’m skipping f4 because it doesn’t qualify and f5 because it looks too ugly :slight_smile:

Lets say that manually writing the broadcast is the baseline for this exercise:

@btime $df.a .+ $df.b .^ 2 .+ 5;  #  107.527 μs (12 allocations: 781.63 KiB)

So in my view the fastest solution that qualifies is convert to SA and broadcast on that. Can’t say it’s pretty though :slight_smile:

1 Like

For “structs of arrays”, there’s the great StructArrays.jl package. It makes it seamless to switch between row- and column-based storage for tables, while keeping the exact same API: just use Vector or StructVector of namedtuples.

Of course, broadcasting (and other Julia functionality) works just as you would expect with structarrays.


Could probably just be StructArray(columntable(df))?
But more generally, you may want to consider just using arrays/structarrays of namedtuples for your tables :slight_smile:

1 Like

Oh I guess @btime might be performing some sort of optimization. But in general

df.a .* df.b .+ df.c

might be slow because Julia doesn’t know what type df.a etc. is. DataFrames are not typed (and that’s a good thing!). Fortunately a function barrier solves this, and

_f(a, b, c) = a .* b .+ c
_f(df.a, df.b, df.c)

solves this problem. The @with macro is just a convenient way to do that process.

I’m pretty sure the broadcasting itself already serves as the function barrier, i.e.
df.a .+ df.b .^ 2 .+ 5 is equivalent to _f(a, b) = a + b^2 + 5; broadcast(_f, df.a, df.b). But it is indeed useful to point of that something like

function manual_func_loop(df)
    v = Vector{Float64}(undef, nrow(df))
    for i = 1:nrow(df)
        v[i] = df.a[i] + df.b[i]^2 + 5  # (or  v[i] = func(df.a[i], df.b[i], df.c[i]) )
    end
    return v
end

will have terrible performance:

julia> @btime manual_func_loop($df);
  20.713 ms (798472 allocations: 12.95 MiB)

(as typeof(DataFrames._columns(df)) === Vector{AbstractVector}).


This abstractness is presumably also why f6 is fastest (without the conversion to StructArray), as you (unfairly) avoid a dynamic dispatch. Though a single one does not make much of a performance difference.