I think the main speed difference comes from the fact that the data was read from an .arrow file and therefore not stored in memory. Also Iβm probably not handling interpolation correctly? Having the dataset in chronological order also makes a difference but not as much.
Accounting for size and different data types did not make much difference.
df2=DataFrame(c1=rand(1:4,10^7),c2=rand(1:4,10^7),c3=rand(1:4,10^7));
df2.dates = rand(Date(2023,10,1):today(),nrow(df2));
df2.times = rand(Time(0,0):Hour(1):Time(23,59), nrow(df2));
The times for the methods from fastest to slowest I got were
@benchmark $df2[findfirst(df2.dates .==Date(2024,2,23) .&& df2.times .== Time(10,00,00)),:]
BenchmarkTools.Trial: 1569 samples with 1 evaluation.
Range (min β¦ max): 3.027 ms β¦ 26.248 ms β GC (min β¦ max): 0.00% β¦ 87.62%
Time (median): 3.144 ms β GC (median): 0.00%
Time (mean Β± Ο): 3.180 ms Β± 762.015 ΞΌs β GC (mean Β± Ο): 0.85% Β± 3.09%
ββββ
ββ
βββββββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββ β
3.03 ms Histogram: frequency by time 3.44 ms <
Memory estimate: 1.20 MiB, allocs estimate: 12.
@benchmark $df2[findall(==(Date(2024,2,23)), df2.dates),:][findfirst(==(Time(10,00,00)),df2[findall(==(Date(2024,2,23)), df2.dates),:].times),:]
BenchmarkTools.Trial: 1128 samples with 1 evaluation.
Range (min β¦ max): 4.144 ms β¦ 32.142 ms β GC (min β¦ max): 0.00% β¦ 86.06%
Time (median): 4.339 ms β GC (median): 0.00%
Time (mean Β± Ο): 4.428 ms Β± 1.329 ms β GC (mean Β± Ο): 1.73% Β± 4.89%
ββ
βββββββββββ β
ββββ
β
ββ
ββ
β
βββββββββββββββββββββ
ββββ
ββββββββββββ
ββββ
βββββββ
β
4.14 ms Histogram: log(frequency) by time 4.82 ms <
Memory estimate: 2.58 MiB, allocs estimate: 50.
@benchmark $df2[findfirst(==((Date(2024,2,23),Time(10,00,00))), tuple.(df2.dates,df2.times)),:]
BenchmarkTools.Trial: 350 samples with 1 evaluation.
Range (min β¦ max): 11.626 ms β¦ 25.382 ms β GC (min β¦ max): 0.00% β¦ 44.17%
Time (median): 12.023 ms β GC (median): 0.00%
Time (mean Β± Ο): 14.319 ms Β± 4.501 ms β GC (mean Β± Ο): 15.44% Β± 19.05%
ββ
βββββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββ β
11.6 ms Histogram: frequency by time 24.3 ms <
Memory estimate: 152.59 MiB, allocs estimate: 6
Having the data in chronological order accounted for some small difference in time but not much. As again nilshgβs method is the fastest.
df3 = sort(df2, [:dates, :times])
@benchmark $df3[findfirst(df3.dates .==Date(2024,2,23) .&& df3.times .== Time(10,00,00)),:]
BenchmarkTools.Trial: 1548 samples with 1 evaluation.
Range (min β¦ max): 3.070 ms β¦ 39.209 ms β GC (min β¦ max): 0.00% β¦ 91.70%
Time (median): 3.176 ms β GC (median): 0.00%
Time (mean Β± Ο): 3.222 ms Β± 1.028 ms β GC (mean Β± Ο): 1.08% Β± 3.16%
ββββ ββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
3.07 ms Histogram: frequency by time 3.37 ms <
Memory estimate: 1.20 MiB, allocs estimate: 12.
@benchmark $df3[findall(==(Date(2024,2,23)), df3.dates),:][findfirst(==(Time(10,00,00)),df3[findall(==(Date(2024,2,23)), df3.dates),:].times),:]
BenchmarkTools.Trial: 1214 samples with 1 evaluation.
Range (min β¦ max): 3.959 ms β¦ 29.343 ms β GC (min β¦ max): 0.00% β¦ 85.85%
Time (median): 4.039 ms β GC (median): 0.00%
Time (mean Β± Ο): 4.113 ms Β± 1.132 ms β GC (mean Β± Ο): 1.34% Β± 4.19%
ββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
3.96 ms Histogram: frequency by time 4.64 ms <
Memory estimate: 2.58 MiB, allocs estimate: 51.
@benchmark $df3[findfirst(==((Date(2024,2,23),Time(10,00,00))), tuple.(df3.dates,df3.times)),:]
BenchmarkTools.Trial: 243 samples with 1 evaluation.
Range (min β¦ max): 17.869 ms β¦ 35.774 ms β GC (min β¦ max): 0.00% β¦ 44.32%
Time (median): 18.219 ms β GC (median): 0.00%
Time (mean Β± Ο): 20.613 ms Β± 5.339 ms β GC (mean Β± Ο): 10.51% Β± 15.80%
βββββ ββ
βββββββββ
βββββββββββββββββββββββββββββββββββββββββββ
ββββ
βββ β
Memory estimate: 152.59 MiB, allocs estimate: 6.
Majority of the difference came from reading the data from a .arrow that was not brought into RAM.
Arrow.write("filepath/datatest.arrow", df3)
df4 = DataFrame(Arrow.Table("/filepath/datatest.arrow")
@benchmark $df4[findall(==(Date(2024,2,23)), df4.dates),:][findfirst(==(Time(10,00,00)),df4[findall(==(Date(2024,2,23)), df4.dates),:].times),:]
BenchmarkTools.Trial: 1177 samples with 1 evaluation.
Range (min β¦ max): 3.996 ms β¦ 46.623 ms β GC (min β¦ max): 0.00% β¦ 91.01%
Time (median): 4.103 ms β GC (median): 0.00%
Time (mean Β± Ο): 4.244 ms Β± 2.093 ms β GC (mean Β± Ο): 2.82% Β± 5.19%
βββββββββββ β
ββββββββββββββββββββββββββββ
ββββββ
ββ
ββββββ
ββ
ββββ
ββββββ
β
βββ β
4 ms Histogram: log(frequency) by time 4.47 ms <
Memory estimate: 2.58 MiB, allocs estimate: 51.
julia> @benchmark $df4[findfirst(df4.dates .==Date(2024,2,23) .&& df4.times .== Time(10,00,00)),:]
BenchmarkTools.Trial: 555 samples with 1 evaluation.
Range (min β¦ max): 8.882 ms β¦ 9.872 ms β GC (min β¦ max): 0.00% β¦ 0.00%
Time (median): 8.988 ms β GC (median): 0.00%
Time (mean Β± Ο): 9.001 ms Β± 65.966 ΞΌs β GC (mean Β± Ο): 0.00% Β± 0.00%
ββββ β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
8.88 ms Histogram: frequency by time 9.27 ms <
Memory estimate: 1.20 MiB, allocs estimate: 12.
julia> @benchmark $df4[findfirst(==((Date(2024,2,23),Time(10,00,00))), tuple.(df4.dates,df4.times)),:]
BenchmarkTools.Trial: 218 samples with 1 evaluation.
Range (min β¦ max): 20.146 ms β¦ 35.459 ms β GC (min β¦ max): 0.00% β¦ 36.70%
Time (median): 20.695 ms β GC (median): 0.00%
Time (mean Β± Ο): 22.958 ms Β± 4.874 ms β GC (mean Β± Ο): 9.31% Β± 14.25%
ββββββ ββ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
20.1 ms Histogram: log(frequency) by time 35.2 ms <
Memory estimate: 152.59 MiB, allocs estimate: 6.
Thereβs a similar but smaller slowdown with the ffr()
function when reading an arrow file. Iβm not sure why this is or is it just because iβm handling interpolation incorrectly?