Performance of eachrow(::DataFrame)

DataFrames are stored with each column contiguous in memory so that implementation iterating over eachrow will be slow. Instead search through df.patient_id.

Your predicate there is true β€” DataFrames are column contiguous β€” but your conclusion doesn’t necessarily follow. It’d only be the case if you accessed every element of the row. But iterating over eachrow gives you a lazy reference to the row and it’ll only lookup/access the value you ask for. Edit: It’s unfortunately isn’t as fast, but the trouble is a type instability, not the access pattern.

I believe you but I’m not sure how to prove it.

using DataFrames
using BenchmarkTools
f(df, x) = findfirst(r -> r.var"1" == x, eachrow(df))
g(df, x) = findfirst(==(x), df.var"1")

@benchmark f(d, 500_000) setup = (d = DataFrame(Dict(Symbol(i) => 1:10^6 for i in 1:100)))
@benchmark g(d, 500_000) setup = (d = DataFrame(Dict(Symbol(i) => 1:10^6 for i in 1:100)))

julia> @benchmark f(d, 500_000) setup = (d = DataFrame(Dict(Symbol(i) => 1:10^6 for i in 1:100)))
BenchmarkTools.Trial: 31 samples with 1 evaluation.
 Range (min … max):   90.518 ms … 113.115 ms  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     108.633 ms               β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   106.752 ms Β±   5.340 ms  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

                                         β–ƒ β–ƒ      β–ƒβ–ˆβ–ƒ β–ˆβ–ƒ      β–ƒ  
  β–‡β–β–β–β–β–β–‡β–β–β–β–β–β–β–β–‡β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–‡β–β–β–β–β–‡β–‡β–β–‡β–ˆβ–‡β–ˆβ–β–β–β–β–‡β–‡β–ˆβ–ˆβ–ˆβ–‡β–ˆβ–ˆβ–β–‡β–β–β–β–‡β–ˆ ▁
  90.5 ms          Histogram: frequency by time          113 ms <

 Memory estimate: 38.13 MiB, allocs estimate: 2498981.

julia> @benchmark g(d, 500_000) setup = (d = DataFrame(Dict(Symbol(i) => 1:10^6 for i in 1:100)))
BenchmarkTools.Trial: 87 samples with 1 evaluation.
 Range (min … max):  502.719 ΞΌs … 727.920 ΞΌs  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     528.437 ΞΌs               β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   538.021 ΞΌs Β±  36.613 ΞΌs  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

  β–†β–„ β–‚β–„β–„β–ˆ  β–„    β–„    β–‚                                           
  β–ˆβ–ˆβ–†β–ˆβ–ˆβ–ˆβ–ˆβ–β–†β–ˆβ–†β–„β–ˆβ–β–ˆβ–†β–„β–„β–†β–ˆβ–„β–ˆβ–†β–†β–ˆβ–ˆβ–†β–β–„β–β–β–„β–β–β–β–„β–β–β–β–β–β–β–β–β–β–β–„β–β–β–β–β–β–β–β–„β–„β–β–β–β–β–† ▁
  503 ΞΌs           Histogram: frequency by time          640 ΞΌs <

 Memory estimate: 32 bytes, allocs estimate: 2.
1 Like

The trouble is that DataFrames’ columns aren’t typed. Accessing df.col1 up-front pays that penalty once whereas accessing r.col1 for r in eachrow(df) pays it on every single iteration.

1 Like

Yes eachrow is designed to be convenient not fast (because of the type-instability).
There are many use cases, when eachrow is fast enough and the 5x overhead reported by @jar1 above is negligible from the user’s perspective.

If someone wants a type stable iterator (that is faster) then Tables.namedtupleiterator can be used (but in this case it is slower than eachrow for some reason - I have not investigated it in detail why).

2 Likes