DataFrames are stored with each column contiguous in memory so that implementation iterating over eachrow
will be slow. Instead search through df.patient_id
.
Your predicate there is true β DataFrames are column contiguous β but your conclusion doesnβt necessarily follow. Itβd only be the case if you accessed every element of the row. But iterating over eachrow
gives you a lazy reference to the row and itβll only lookup/access the value you ask for. Edit: Itβs unfortunately isnβt as fast, but the trouble is a type instability, not the access pattern.
I believe you but Iβm not sure how to prove it.
using DataFrames
using BenchmarkTools
f(df, x) = findfirst(r -> r.var"1" == x, eachrow(df))
g(df, x) = findfirst(==(x), df.var"1")
@benchmark f(d, 500_000) setup = (d = DataFrame(Dict(Symbol(i) => 1:10^6 for i in 1:100)))
@benchmark g(d, 500_000) setup = (d = DataFrame(Dict(Symbol(i) => 1:10^6 for i in 1:100)))
julia> @benchmark f(d, 500_000) setup = (d = DataFrame(Dict(Symbol(i) => 1:10^6 for i in 1:100)))
BenchmarkTools.Trial: 31 samples with 1 evaluation.
Range (min β¦ max): 90.518 ms β¦ 113.115 ms β GC (min β¦ max): 0.00% β¦ 0.00%
Time (median): 108.633 ms β GC (median): 0.00%
Time (mean Β± Ο): 106.752 ms Β± 5.340 ms β GC (mean Β± Ο): 0.00% Β± 0.00%
β β βββ ββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
90.5 ms Histogram: frequency by time 113 ms <
Memory estimate: 38.13 MiB, allocs estimate: 2498981.
julia> @benchmark g(d, 500_000) setup = (d = DataFrame(Dict(Symbol(i) => 1:10^6 for i in 1:100)))
BenchmarkTools.Trial: 87 samples with 1 evaluation.
Range (min β¦ max): 502.719 ΞΌs β¦ 727.920 ΞΌs β GC (min β¦ max): 0.00% β¦ 0.00%
Time (median): 528.437 ΞΌs β GC (median): 0.00%
Time (mean Β± Ο): 538.021 ΞΌs Β± 36.613 ΞΌs β GC (mean Β± Ο): 0.00% Β± 0.00%
ββ ββββ β β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
503 ΞΌs Histogram: frequency by time 640 ΞΌs <
Memory estimate: 32 bytes, allocs estimate: 2.
The trouble is that DataFramesβ columns arenβt typed. Accessing df.col1
up-front pays that penalty once whereas accessing r.col1 for r in eachrow(df)
pays it on every single iteration.
Yes eachrow
is designed to be convenient not fast (because of the type-instability).
There are many use cases, when eachrow
is fast enough and the 5x overhead reported by @jar1 above is negligible from the userβs perspective.
If someone wants a type stable iterator (that is faster) then Tables.namedtupleiterator
can be used (but in this case it is slower than eachrow
for some reason - I have not investigated it in detail why).