Writing a function that also works broadcasts over DataFrames

The fundamental problem is that the types in a DataFrameRow are not known:

help?> eachrow
(...)
  eachrow(df::AbstractDataFrame)

  Return a DataFrameRows that iterates a data frame row by row, with
  each row represented as a DataFrameRow.

  Because DataFrameRows have an eltype of Any, use
  copy(dfr::DataFrameRow) to obtain a named tuple, which supports
  iteration and property access like a DataFrameRow, but also passes
  information on the eltypes of the columns of df.

help?> DataFrameRow
(...)  
  A DataFrameRow supports the iteration interface and can therefore be
  passed to functions that expect a collection as an argument. Its
  element type is always Any.

But instead of eachrow, you can use Tables.namedtupleiterator, though this is not exported. See also Fast iteration over rows of a DataFrame - #10 by bkamins

julia> @btime func.(eachrow($df));
  21.160 ms (898990 allocations: 14.48 MiB)

julia> @btime func.(Tables.namedtupleiterator($df));
  616.200 μs (23 allocations: 3.05 MiB)

julia> @btime map(func, Tables.namedtupleiterator($df))  # Edit: added this. Not sure what's the difference with broadcast, but map is faster here (and does not make a difference for the other options)
  152.600 μs (20 allocations: 781.91 KiB)

julia> @btime $df.a .+ $df.b .^ 2 .+ 5;
  49.000 μs (12 allocations: 781.63 KiB)

Alternatively, you could construct e.g. a StructArray out of the DataFrame and perform your computations on that:

julia> using StructArrays

julia> function df_to_structarray(df::DataFrame)
           p = propertynames(df)  # e.g. [:a, :b, :c]
           nt = NamedTuple(zip(p, getindex.(Ref(df), !, p)))  # (a=df.a, b=df.b, c=df.c)
           return StructArray(nt)
       end

julia> @btime func.(df_to_structarray($df));
  54.200 μs (45 allocations: 782.97 KiB)
1 Like