The fundamental problem is that the types in a DataFrameRow
are not known:
help?> eachrow
(...)
eachrow(df::AbstractDataFrame)
Return a DataFrameRows that iterates a data frame row by row, with
each row represented as a DataFrameRow.
Because DataFrameRows have an eltype of Any, use
copy(dfr::DataFrameRow) to obtain a named tuple, which supports
iteration and property access like a DataFrameRow, but also passes
information on the eltypes of the columns of df.
help?> DataFrameRow
(...)
A DataFrameRow supports the iteration interface and can therefore be
passed to functions that expect a collection as an argument. Its
element type is always Any.
But instead of eachrow
, you can use Tables.namedtupleiterator
, though this is not exported. See also Fast iteration over rows of a DataFrame - #10 by bkamins
julia> @btime func.(eachrow($df));
21.160 ms (898990 allocations: 14.48 MiB)
julia> @btime func.(Tables.namedtupleiterator($df));
616.200 μs (23 allocations: 3.05 MiB)
julia> @btime map(func, Tables.namedtupleiterator($df)) # Edit: added this. Not sure what's the difference with broadcast, but map is faster here (and does not make a difference for the other options)
152.600 μs (20 allocations: 781.91 KiB)
julia> @btime $df.a .+ $df.b .^ 2 .+ 5;
49.000 μs (12 allocations: 781.63 KiB)
Alternatively, you could construct e.g. a StructArray
out of the DataFrame
and perform your computations on that:
julia> using StructArrays
julia> function df_to_structarray(df::DataFrame)
p = propertynames(df) # e.g. [:a, :b, :c]
nt = NamedTuple(zip(p, getindex.(Ref(df), !, p))) # (a=df.a, b=df.b, c=df.c)
return StructArray(nt)
end
julia> @btime func.(df_to_structarray($df));
54.200 μs (45 allocations: 782.97 KiB)