Performance of eachrow(::DataFrame)

jar1 · August 24, 2023, 5:59pm

DataFrames are stored with each column contiguous in memory so that implementation iterating over eachrow will be slow. Instead search through df.patient_id.

mbauman · August 24, 2023, 6:34pm

Your predicate there is true — DataFrames are column contiguous — but your conclusion doesn’t necessarily follow. It’d only be the case if you accessed every element of the row. But iterating over eachrow gives you a lazy reference to the row and it’ll only lookup/access the value you ask for. Edit: It’s unfortunately isn’t as fast, but the trouble is a type instability, not the access pattern.

jar1 · August 24, 2023, 6:49pm

I believe you but I’m not sure how to prove it.

using DataFrames
using BenchmarkTools
f(df, x) = findfirst(r -> r.var"1" == x, eachrow(df))
g(df, x) = findfirst(==(x), df.var"1")

@benchmark f(d, 500_000) setup = (d = DataFrame(Dict(Symbol(i) => 1:10^6 for i in 1:100)))
@benchmark g(d, 500_000) setup = (d = DataFrame(Dict(Symbol(i) => 1:10^6 for i in 1:100)))

julia> @benchmark f(d, 500_000) setup = (d = DataFrame(Dict(Symbol(i) => 1:10^6 for i in 1:100)))
BenchmarkTools.Trial: 31 samples with 1 evaluation.
 Range (min … max):   90.518 ms … 113.115 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     108.633 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   106.752 ms ±   5.340 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%

                                         ▃ ▃      ▃█▃ █▃      ▃  
  ▇▁▁▁▁▁▇▁▁▁▁▁▁▁▇▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▇▁▁▁▁▇▇▁▇█▇█▁▁▁▁▇▇███▇██▁▇▁▁▁▇█ ▁
  90.5 ms          Histogram: frequency by time          113 ms <

 Memory estimate: 38.13 MiB, allocs estimate: 2498981.

julia> @benchmark g(d, 500_000) setup = (d = DataFrame(Dict(Symbol(i) => 1:10^6 for i in 1:100)))
BenchmarkTools.Trial: 87 samples with 1 evaluation.
 Range (min … max):  502.719 μs … 727.920 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     528.437 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   538.021 μs ±  36.613 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▆▄ ▂▄▄█  ▄    ▄    ▂                                           
  ██▆████▁▆█▆▄█▁█▆▄▄▆█▄█▆▆██▆▁▄▁▁▄▁▁▁▄▁▁▁▁▁▁▁▁▁▁▄▁▁▁▁▁▁▁▄▄▁▁▁▁▆ ▁
  503 μs           Histogram: frequency by time          640 μs <

 Memory estimate: 32 bytes, allocs estimate: 2.

mbauman · August 24, 2023, 7:16pm

The trouble is that DataFrames’ columns aren’t typed. Accessing df.col1 up-front pays that penalty once whereas accessing r.col1 for r in eachrow(df) pays it on every single iteration.

bkamins · August 24, 2023, 7:55pm

Yes eachrow is designed to be convenient not fast (because of the type-instability).
There are many use cases, when eachrow is fast enough and the 5x overhead reported by @jar1 above is negligible from the user’s perspective.

If someone wants a type stable iterator (that is faster) then Tables.namedtupleiterator can be used (but in this case it is slower than eachrow for some reason - I have not investigated it in detail why).

Topic		Replies	Views
Fast iteration over rows of a DataFrame Performance	14	14165	June 30, 2020
Accessing a column value from DataFrameRow allocates Performance dataframes	10	832	March 7, 2022
DataFrame transformation is so slow, what am I doing wrong? Performance compilation , dataframes	17	337	May 19, 2024
Is there an equivalent of eachindex() for DataFrames? General Usage question , dataframes , type-stability	13	1363	October 21, 2022
Performance: Fast way to access numbers in Dataframes or alternatives Performance dataframes , data_structures	12	1180	November 15, 2022

Performance of eachrow(::DataFrame)

Related topics