Using findfirst with for multiple values and columns in a Dataframe?

Is there a way to extend the findfirst(==(val1), df.col1) for more than one value over more than column in a dataframe. Something like

findfirst(==([ val1, val2]) , [df.col1, df.col2])

I tried

findfirst(row -> (row[1] == val1 && row[2] == val2), eachrow(select(df, [:col1, :col2])))

but that seems to cause a significant slowdown

@benchmark findfirst(row -> (row[1] == Date(2024,2,23) && row[2] == Time(10,00,00)), eachrow(select(df1, [:date, :time])))
BenchmarkTools.Trial: 9 samples with 1 evaluation.
 Range (min … max):  602.970 ms … 658.486 ms  β”Š GC (min … max): 0.00% … 6.52%
 Time  (median):     608.596 ms               β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   626.526 ms Β±  25.573 ms  β”Š GC (mean Β± Οƒ):  2.97% Β± 3.38%

  β–ˆβ–  ▁ ▁                                          ▁▁        ▁▁  
  β–ˆβ–ˆβ–β–β–ˆβ–β–ˆβ–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–ˆβ–ˆβ–β–β–β–β–β–β–β–β–ˆβ–ˆ ▁
  603 ms           Histogram: frequency by time          658 ms <

 Memory estimate: 284.43 MiB, allocs estimate: 14908879.

as compared to creating a subset with findall on val1 then findfirst on the subset with val2 e.g.

@benchmark df1[findall(==(Date(2024,2,23)), df1.date),:][findfirst(==(Time(10,00,00)),df1[findall(==(Date(2024,2,23)), df1.date),:].time),:]
BenchmarkTools.Trial: 762 samples with 1 evaluation.
 Range (min … max):  5.964 ms … 22.915 ms  β”Š GC (min … max): 0.00% … 72.43%
 Time  (median):     6.324 ms              β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   6.562 ms Β±  1.695 ms  β”Š GC (mean Β± Οƒ):  3.08% Β±  8.16%

   β–ˆβ–„β–                                                        
  β–ˆβ–ˆβ–ˆβ–ˆβ–†β–„β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–„β–β–β–β–β–β–β–β–β–β–„ β–‡
  5.96 ms      Histogram: log(frequency) by time     19.1 ms <

 Memory estimate: 7.27 MiB, allocs estimate: 2008.

or masking directly

df[ df.col1 .== val1 .&& df.col2 .== val2, :][1,:]

julia> @benchmark  df1[ df1.date .== Date(2024,2,23) .&& df1.time .== Time(10,00,00), :][1,:]
BenchmarkTools.Trial: 186 samples with 1 evaluation.
 Range (min … max):  26.848 ms …  27.793 ms  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     26.971 ms               β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   27.008 ms Β± 136.690 ΞΌs  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

        β–ƒβ–ˆβ–‡β–‚β–ƒβ–„                                                  
  β–ƒβ–†β–…β–‡β–†β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–†β–…β–†β–ƒβ–„β–†β–†β–†β–ƒβ–…β–ƒβ–β–„β–ƒβ–ƒβ–ƒβ–ƒβ–„β–β–ƒβ–β–β–β–β–β–ƒβ–β–β–β–ƒβ–β–β–β–β–ƒβ–β–β–β–β–β–β–β–ƒβ–β–β–β–β–ƒ β–ƒ
  26.8 ms         Histogram: frequency by time         27.6 ms <

 Memory estimate: 461.77 KiB, allocs estimate: 36.

I checked the documentation but it did not include anything on it so just wondering if there was a more direct approach using findfirst for more multiple values over multiple arrays.

1 Like

findfirst(df.col1 .== val1 .&& df.col2 .== val2)

1 Like

awesome thanks! Just curious why is subsetting with findall on val1 first meaningfully faster?

 @benchmark df1[findfirst(df1.date .==(Date(2024,2,23)) .&& df1.time .==(Time(10,00,00))),:]
BenchmarkTools.Trial: 186 samples with 1 evaluation.
 Range (min … max):  26.772 ms … 27.584 ms  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     26.872 ms              β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   26.891 ms Β± 96.213 ΞΌs  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

        β–„β–‡β–…β–‚β–†β–ƒβ–„β–ˆ ▃▂▁▁                                          
  β–ƒβ–ƒβ–ƒβ–†β–†β–‡β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–ˆβ–…β–ƒβ–ƒβ–ƒβ–β–„β–†β–β–β–ƒβ–β–ƒβ–β–β–β–β–β–β–ƒβ–ƒβ–ƒβ–ƒβ–β–ƒβ–β–β–β–β–β–β–ƒβ–β–β–ƒβ–β–ƒβ–β–ƒ β–ƒ
  26.8 ms         Histogram: frequency by time        27.2 ms <

 Memory estimate: 460.20 KiB, allocs estimate: 12.
findfirst(==((v1,v2)), tuple.(df.c1,df.c2))
2 Likes

What expression are you referring to?
Can you post a data set to test on?

I’ve tried different ways to check times and allocations and something like the following seems the most efficient (although I don’t understand why it has even a small allocation):


function ffr(c1,c2,f,l)
    i=1
    for (fr,lr) in zip(c1,c2)
        fr == f && lr == l && return i
        i+=1
    end
end

when compared to expressions like these


findfirst(==((5,5)), tuple.(df.c1,df.c2))
findfirst(==((5,5)), collect(zip(df.c1,df.c2)))

 findfirst(t->t[1]==5 &&t[2]==5, tuple.(df.c1,df.c2))

 findfirst(r->(r.c1,r.c2)==(5,5), eachrow(df))
 findfirst(r->r.c1==5 && r.c2==5, eachrow(df))

 findfirst(df.c1 .== 5 .&& df.c2 .== 5)

I used this df for testing

df=DataFrame(c1=rand(1:4,10^3),c2=rand(1:4,10^3),c3=rand(1:4,10^3))
push!(df,(5,5,5))

I think the main speed difference comes from the fact that the data was read from an .arrow file and therefore not stored in memory. Also I’m probably not handling interpolation correctly? Having the dataset in chronological order also makes a difference but not as much.

Accounting for size and different data types did not make much difference.

df2=DataFrame(c1=rand(1:4,10^7),c2=rand(1:4,10^7),c3=rand(1:4,10^7)); 
df2.dates = rand(Date(2023,10,1):today(),nrow(df2));
df2.times = rand(Time(0,0):Hour(1):Time(23,59), nrow(df2));

The times for the methods from fastest to slowest I got were

@benchmark $df2[findfirst(df2.dates .==Date(2024,2,23) .&& df2.times .== Time(10,00,00)),:]
BenchmarkTools.Trial: 1569 samples with 1 evaluation.
 Range (min … max):  3.027 ms …  26.248 ms  β”Š GC (min … max): 0.00% … 87.62%
 Time  (median):     3.144 ms               β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   3.180 ms Β± 762.015 ΞΌs  β”Š GC (mean Β± Οƒ):  0.85% Β±  3.09%

               β–‡β–ˆβ–„β–…β–„β–‚                                          
  β–‚β–‚β–‚β–ƒβ–‚β–‚β–β–‚β–β–β–‚β–‚β–„β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–†β–…β–„β–ƒβ–ƒβ–ƒβ–‚β–‚β–‚β–‚β–‚β–‚β–β–‚β–‚β–β–β–‚β–‚β–β–β–β–β–β–β–β–β–‚β–‚β–β–‚β–β–β–β–β–‚β–β–‚ β–ƒ
  3.03 ms         Histogram: frequency by time        3.44 ms <
Memory estimate: 1.20 MiB, allocs estimate: 12.

@benchmark $df2[findall(==(Date(2024,2,23)), df2.dates),:][findfirst(==(Time(10,00,00)),df2[findall(==(Date(2024,2,23)), df2.dates),:].times),:]
BenchmarkTools.Trial: 1128 samples with 1 evaluation.
 Range (min … max):  4.144 ms … 32.142 ms  β”Š GC (min … max): 0.00% … 86.06%
 Time  (median):     4.339 ms              β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   4.428 ms Β±  1.329 ms  β”Š GC (mean Β± Οƒ):  1.73% Β±  4.89%

              β–„β–…β–‡β–ˆβ–ˆβ–‡β–‡β–†β–†β–„β–ƒβ–‚β–‚                                  ▁
  β–†β–†β–†β–…β–…β–†β–…β–„β–…β–…β–‡β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–†β–‡β–„β–†β–…β–„β–„β–†β–…β–β–β–β–„β–β–„β–β–„β–β–„β–β–…β–β–β–β–…β–β–β–β–β–β–β–… β–ˆ
  4.14 ms      Histogram: log(frequency) by time     4.82 ms <
Memory estimate: 2.58 MiB, allocs estimate: 50.

@benchmark $df2[findfirst(==((Date(2024,2,23),Time(10,00,00))), tuple.(df2.dates,df2.times)),:]
BenchmarkTools.Trial: 350 samples with 1 evaluation.
 Range (min … max):  11.626 ms … 25.382 ms  β”Š GC (min … max):  0.00% … 44.17%
 Time  (median):     12.023 ms              β”Š GC (median):     0.00%
 Time  (mean Β± Οƒ):   14.319 ms Β±  4.501 ms  β”Š GC (mean Β± Οƒ):  15.44% Β± 19.05%

  β–ˆβ–†                                                           
  β–ˆβ–ˆβ–‡β–„β–„β–ƒβ–ƒβ–ƒβ–‚β–ƒβ–‚β–‚β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–‚β–β–β–β–β–β–ƒβ–…β–„β–„β–ƒβ–ƒβ–ƒβ–‚ β–‚
  11.6 ms         Histogram: frequency by time        24.3 ms <
Memory estimate: 152.59 MiB, allocs estimate: 6

Having the data in chronological order accounted for some small difference in time but not much. As again nilshg’s method is the fastest.

df3 = sort(df2, [:dates, :times])

@benchmark $df3[findfirst(df3.dates .==Date(2024,2,23) .&& df3.times .== Time(10,00,00)),:]
BenchmarkTools.Trial: 1548 samples with 1 evaluation.
 Range (min … max):  3.070 ms … 39.209 ms  β”Š GC (min … max): 0.00% … 91.70%
 Time  (median):     3.176 ms              β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   3.222 ms Β±  1.028 ms  β”Š GC (mean Β± Οƒ):  1.08% Β±  3.16%

                  β–„β–ˆβ–ƒβ– ▂▁                                     
  β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–β–‚β–β–β–‚β–ƒβ–†β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–†β–„β–„β–„β–„β–„β–ƒβ–ƒβ–ƒβ–‚β–ƒβ–ƒβ–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–β–‚β–‚β–β–‚β–‚β–‚β–‚β–β–‚β–‚ β–ƒ
  3.07 ms        Histogram: frequency by time        3.37 ms <

 Memory estimate: 1.20 MiB, allocs estimate: 12.

@benchmark $df3[findall(==(Date(2024,2,23)), df3.dates),:][findfirst(==(Time(10,00,00)),df3[findall(==(Date(2024,2,23)), df3.dates),:].times),:]
BenchmarkTools.Trial: 1214 samples with 1 evaluation.
 Range (min … max):  3.959 ms … 29.343 ms  β”Š GC (min … max): 0.00% … 85.85%
 Time  (median):     4.039 ms              β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   4.113 ms Β±  1.132 ms  β”Š GC (mean Β± Οƒ):  1.34% Β±  4.19%

       β–ˆβ–‡β–ƒβ–                                                   
  β–‚β–β–β–ƒβ–‡β–ˆβ–ˆβ–ˆβ–ˆβ–†β–„β–ƒβ–ƒβ–ƒβ–ƒβ–‚β–‚β–‚β–‚β–β–β–β–‚β–β–‚β–β–‚β–‚β–β–β–β–β–β–β–β–β–β–‚β–‚β–β–β–β–β–β–β–β–β–β–β–β–β–‚β–β–β–β–β–β–‚ β–ƒ
  3.96 ms        Histogram: frequency by time        4.64 ms <
Memory estimate: 2.58 MiB, allocs estimate: 51.

@benchmark $df3[findfirst(==((Date(2024,2,23),Time(10,00,00))), tuple.(df3.dates,df3.times)),:]
BenchmarkTools.Trial: 243 samples with 1 evaluation.
 Range (min … max):  17.869 ms … 35.774 ms  β”Š GC (min … max):  0.00% … 44.32%
 Time  (median):     18.219 ms              β”Š GC (median):     0.00%
 Time  (mean Β± Οƒ):   20.613 ms Β±  5.339 ms  β”Š GC (mean Β± Οƒ):  10.51% Β± 15.80%

  β–ˆβ–†β–ƒβ–‚β–                                               ▂▁       
  β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–†β–ˆβ–ˆβ–…β–„β–β–„β–„β–β–„β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–„β–β–β–β–β–β–„β–β–β–β–β–β–…β–ˆβ–ˆβ–†β–…β–β–„β–„ β–…

Memory estimate: 152.59 MiB, allocs estimate: 6.

Majority of the difference came from reading the data from a .arrow that was not brought into RAM.

Arrow.write("filepath/datatest.arrow", df3)
df4 = DataFrame(Arrow.Table("/filepath/datatest.arrow")

@benchmark $df4[findall(==(Date(2024,2,23)), df4.dates),:][findfirst(==(Time(10,00,00)),df4[findall(==(Date(2024,2,23)), df4.dates),:].times),:]
BenchmarkTools.Trial: 1177 samples with 1 evaluation.
 Range (min … max):  3.996 ms … 46.623 ms  β”Š GC (min … max): 0.00% … 91.01%
 Time  (median):     4.103 ms              β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   4.244 ms Β±  2.093 ms  β”Š GC (mean Β± Οƒ):  2.82% Β±  5.19%

           β–‚β–‡β–ˆβ–ˆβ–‡β–†β–†β–„β–ƒβ–ƒβ–‚   ▁                                    
  β–„β–β–β–β–„β–„β–„β–„β–†β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–ˆβ–ˆβ–‡β–ˆβ–‡β–…β–‡β–β–„β–†β–β–…β–β–…β–„β–†β–β–„β–„β–…β–„β–…β–†β–β–β–…β–„β–†β–β–β–β–…β–…β–β–†β–† β–ˆ
  4 ms         Histogram: log(frequency) by time     4.47 ms <

Memory estimate: 2.58 MiB, allocs estimate: 51.

julia> @benchmark $df4[findfirst(df4.dates .==Date(2024,2,23) .&& df4.times .== Time(10,00,00)),:]
BenchmarkTools.Trial: 555 samples with 1 evaluation.
 Range (min … max):  8.882 ms …  9.872 ms  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     8.988 ms              β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   9.001 ms Β± 65.966 ΞΌs  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

               β–†β–ˆβ–†β–„ ▁                                         
  β–‚β–‚β–β–β–β–β–β–β–β–β–β–ƒβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–†β–‡β–„β–„β–ƒβ–„β–ƒβ–ƒβ–‚β–‚β–‚β–‚β–β–‚β–‚β–‚β–β–β–β–β–β–β–β–‚β–β–‚β–β–β–β–‚β–β–β–β–β–β–β–β–β–‚ β–ƒ
  8.88 ms        Histogram: frequency by time        9.27 ms <

 Memory estimate: 1.20 MiB, allocs estimate: 12.

julia> @benchmark $df4[findfirst(==((Date(2024,2,23),Time(10,00,00))), tuple.(df4.dates,df4.times)),:]
BenchmarkTools.Trial: 218 samples with 1 evaluation.
 Range (min … max):  20.146 ms … 35.459 ms  β”Š GC (min … max): 0.00% … 36.70%
 Time  (median):     20.695 ms              β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   22.958 ms Β±  4.874 ms  β”Š GC (mean Β± Οƒ):  9.31% Β± 14.25%

  β–„β–ˆβ–†β–‚β–‚β–                                              ▂▁       
  β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–†β–‡β–†β–†β–„β–†β–„β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–„β–β–β–β–β–β–„β–β–β–β–β–β–β–ˆβ–ˆβ–‡β–„β–†β–†β–„ β–†
  20.1 ms      Histogram: log(frequency) by time      35.2 ms <

 Memory estimate: 152.59 MiB, allocs estimate: 6.

There’s a similar but smaller slowdown with the ffr() function when reading an arrow file. I’m not sure why this is or is it just because i’m handling interpolation incorrectly?

ffr seems much more efficient or am I wrong?

julia> function ffr(c1,c2,f,l)
           i=1
           for (fr,lr) in zip(c1,c2)
               fr == f && lr == l && return i
               i+=1
           end
       end
ffr (generic function with 1 method)

julia> @btime df2[ffr(df2.dates,df2.times,Date(2024,2,23),Time(10,00,00)),:]
  3.587 ΞΌs (3 allocations: 80 bytes)
DataFrameRow
  Row β”‚ c1     c2     c3     dates       times    
      β”‚ Int64  Int64  Int64  Date        Time
──────┼───────────────────────────────────────────
 7482 β”‚     2      1      1  2024-02-23  10:00:00

julia> @btime df2[findfirst(df2.dates .==Date(2024,2,23) .&& df2.times .== Time(10,00,00)),:]
  15.419 ms (14 allocations: 1.20 MiB)
DataFrameRow
  Row β”‚ c1     c2     c3     dates       times    
      β”‚ Int64  Int64  Int64  Date        Time
──────┼───────────────────────────────────────────
 7482 β”‚     2      1      1  2024-02-23  10:00:00

1 Like

yeah you’re right in terms of efficiency. I was just looking at execution times. These are the results I got.

ffr() from RAM

julia> @benchmark $ffr(df3.dates,df3.times,Date(2024,2,23),Time(10,00,00))
BenchmarkTools.Trial: 800 samples with 1 evaluation.
 Range (min … max):  6.179 ms …  6.603 ms  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     6.225 ms              β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   6.239 ms Β± 57.197 ΞΌs  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

   β–ˆβ–‡β–‚β–†β–†β–β–β–‚β–β–  ▁ ▁▃                                           
  β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–ˆβ–‡β–ˆβ–ˆβ–†β–…β–…β–ƒβ–„β–„β–„β–„β–ƒβ–„β–ƒβ–„β–‚β–ƒβ–„β–β–ƒβ–‚β–‚β–‚β–‚β–ƒβ–‚β–β–ƒβ–β–‚β–β–‚β–β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–β–β–‚β–‚ β–„
  6.18 ms        Histogram: frequency by time        6.47 ms <

 Memory estimate: 16 bytes, allocs estimate: 1.

ffr() From Arrow.Table


julia> @benchmark $ffr(df4.dates,df4.times,Date(2024,2,23),Time(10,00,00))
BenchmarkTools.Trial: 535 samples with 1 evaluation.
 Range (min … max):  9.295 ms …  9.428 ms  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     9.330 ms              β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   9.332 ms Β± 22.676 ΞΌs  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

      β–‚β–ˆβ–                ▃▁  β–‚                                
  β–„β–…β–ƒβ–ƒβ–ˆβ–ˆβ–ˆβ–‡β–…β–†β–ˆβ–‡β–ˆβ–‡β–‡β–ˆβ–‡β–ˆβ–‡β–ˆβ–ˆβ–‡β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–…β–„β–…β–…β–†β–„β–„β–„β–…β–‚β–ƒβ–ƒβ–‚β–β–ƒβ–ƒβ–‚β–‚β–‚β–‚β–„β–β–ƒβ–ƒβ–‚β–β–β–‚β–β–‚ β–„
  9.29 ms        Histogram: frequency by time        9.41 ms <

 Memory estimate: 16 bytes, allocs estimate: 1.

If the data is sorted, then you should use searchsortedfirst instead of findfirst and it should make a significant difference.

2 Likes

awesome thanks! What would be the appropriate syntax to apply searchsortedfirst to multiple values from multiple columns in a data frame? The docs only include examples with a single array.

Are the rows sorted by df.dates and df.times jointly?

yes, they are in chronological order by both date and time.

So try to benchmark this:

df3[searchsortedfirst(1:nrow(df3),0; 
  by=i->(i==0 ? (Date(2024,2,23),Time(10,00,00)) :
  (df3.dates[i],df3.times[i]))
), :]
2 Likes

Oh Wow! that’s meaningfully faster - even with the DataFrame out of RAM

Sorted DataFrame in RAM

julia> @benchmark $df3[searchsortedfirst(1:nrow(df3),0; 
         by=i->(i==0 ? (Date(2024,2,23),Time(10,00,00)) :
         (df3.dates[i],df3.times[i]))), :]
BenchmarkTools.Trial: 10000 samples with 5 evaluations.
 Range (min … max):  6.075 ΞΌs …  21.725 ΞΌs  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     6.317 ΞΌs               β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   6.350 ΞΌs Β± 274.666 ns  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

       β–‚β–…β–‡β–ˆβ–†β–†β–„β–‚β–„β–†β–‡β–†β–†β–…β–‚β–                                        
  β–β–β–‚β–„β–‡β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–†β–†β–„β–…β–ƒβ–ƒβ–‚β–‚β–‚β–‚β–‚β–β–β–β–β–β–β–β–β–β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–β–‚β–β–β–β– β–ƒ
  6.08 ΞΌs         Histogram: frequency by time        7.05 ΞΌs <

 Memory estimate: 2.34 KiB, allocs estimate: 125.

Sorted DataFrame out of RAM

julia> @benchmark $df4[searchsortedfirst(1:nrow(df4),0; by=i->(i==0 ? (Date(2024,2,23),Time(10,00,00)) :
         (df4.dates[i],df4.times[i]))), :]
BenchmarkTools.Trial: 10000 samples with 5 evaluations.
 Range (min … max):  5.975 ΞΌs …  12.100 ΞΌs  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     6.350 ΞΌs               β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   6.381 ΞΌs Β± 230.581 ns  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

             β–…β–ƒβ–„β–„β–ƒβ–‚β–†β–β–ƒβ–„β–„β–…β–ˆβ–ƒβ–β–                                  
  β–‚β–β–β–‚β–‚β–ƒβ–ƒβ–„β–„β–†β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–‡β–‡β–…β–…β–„β–„β–„β–„β–ƒβ–ƒβ–ƒβ–ƒβ–ƒβ–ƒβ–ƒβ–ƒβ–ƒβ–ƒβ–‚β–ƒβ–ƒβ–ƒβ–ƒβ–ƒβ–ƒβ–ƒβ–ƒβ–ƒβ–‚β–‚β–‚ β–„
  5.98 ΞΌs         Histogram: frequency by time        7.04 ΞΌs <

 Memory estimate: 2.34 KiB, allocs estimate: 125.

Sorry I’m still having trouble understanding the syntax though. Would you mind clarifying how this works?

searchsortedfirst(1:nrow(df4),0;  by=i->(i==0 ? (Date(2024,2,23),Time(10,00,00)) : (df4.dates[i],df4.times[i])))

The examples in the docs show searchsortedfirst(array, value). I’m assuming that 1:nrow(df4) provides a range to search through but i don’t quite understand where the actual array is being input into the function and how the by clause works with the ternary operator here? Any clarification would be much appreciated thanks!

2 Likes

Yes, the 1:nrow(df4) is the range. Because I’m using the by named parameter which operates on the items searched and the item to search for, I’ve had to use this slightly awkward function which for input i=0 returns the item to search for, and for other i returns the item searched against (the df enters the calculation here).
Additionally, Tuples are sorted by lexicographic order by default, which means an equal Tuple will also be the first Tuple >= the desired value.

2 Likes

The function has this fingerprint searchsortedfirst(a, x; by=<transform>, lt=<comparison>, rev=false).
The comparison between the value to be searched for x and the candidate values a_i is done in the following way lt(by(x),by(a_i)) until the first a_i >=x is found.
I’ll try to give some examples to clarify, in the end, Dan’s β€œslightly awkward” use of it (euphemism).

julia> searchsortedfirst(1:10,-3)
1

julia> searchsortedfirst(1:10,-3,by= abs)
3

julia> searchsortedfirst(1:10,0,by=a_i->(a_i==0 ? 3 : sqrt(a_i)))
9

How does this last case work?
You are required to find the first value in 1:10 that is >=0.
But the comparison is done by first transforming the values with by().
therefore the
first comparison lt( by(x), by(a_1)) is 3 <= sqrt(1);
the second comparison lt( by(x), by(a_1)) is 3 <= sqrt(2);
…
the ninth comparison lt( by(x), by(a_1)) is 3 <= sqrt(9); it’s the β€œgood” one.

You might think that 0 is a particular value but it plays no specific role.
See the examples below:

julia> searchsortedfirst(1:10,NaN,by=x->(x===NaN ? 3 : sqrt(x)))
9

julia> searchsortedfirst(1:10,'x',by=x->(x=='x' ? 3 : sqrt(x)))
9

I add an alternative solution that could be useful if you have to do a lot of research.


gdf=groupby(df2,[:dates,:times]) 

d=Dict(zip(Tuple.(keys(gdf)),gdf.starts)) 

d[(Date(2024,2,23),Time(10,00,00))] 
2 Likes

Wow, thanks so much for the detailed explanations. Sorry i’m slow but I think I might still be missing something.

Based on the following

and

my understanding is that

searchsortedfirst(1:nrow(df4),0;  by=i->(i==0 ? (Date(2024,2,23),Time(10,00,00)) : (df4.dates[i],df4.times[i])))

Applies the function in by to both the array and the search term. 1:nrow(df4) will always return df4.dates, df4.times and 0 will always return the Date and Time searched for because the ternary operator is applying to both terms.

If this understanding is correct and given this example here

why does?

searchsortedfirst( -1:-10,-3,by= abs)
1

Shouldn’t this be equivalent to

abs(3) .<= collect(-1:-10)
0-element BitVector

because

isempty(collect(-1:-10))
true

Are empty sets the only exception to this rule? I just want to make sure i’m not missing something thanks because sometimes its hard to find this information with the ?.

Since it is an empty empty collection, it returns the last index (i.e. 0) +1
this one instead


julia> searchsortedfirst( -1:-1:-10,-3,by= abs)
3

I take this opportunity to give a more readable version of the function applied to the case, which is not much less efficient than the slightly less readable one

searchsortedfirst(eachrow(df2), df2[8331846,:], by=r->(r.dates,r.times))
1 Like

The problem, of course, is that 8331846 is unknown when coding the searchsortedfirst :wink:

1 Like

Got it thanks. That clears up alot. Also thanks @rocco_sprmnt21 for this additional method. I’ve included the benchmark for this method just for reference.

for sorted DataFrame in RAM

 @benchmark research(df3, Date(2024,2,23),Time(10,00,00))
BenchmarkTools.Trial: 13 samples with 1 evaluation.
 Range (min … max):  378.150 ms … 435.834 ms  β”Š GC (min … max): 0.00% … 5.96%
 Time  (median):     383.107 ms               β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   392.793 ms Β±  16.393 ms  β”Š GC (mean Β± Οƒ):  2.44% Β± 2.75%

      β–ƒβ–ˆ                β–ƒ                                        
  β–‡β–β–‡β–β–ˆβ–ˆβ–β–β–β–β–β–‡β–β–β–β–β–β–β–β–β–β–‡β–ˆβ–β–β–β–β–β–β–β–β–β–β–β–β–‡β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–‡ ▁
  378 ms           Histogram: frequency by time          436 ms <

Memory estimate: 441.24 MiB, allocs estimate: 3367422.

For sorted DataFrame out of Ram

julia> @benchmark research(df4, Date(2024,2,23),Time(10,00,00))
BenchmarkTools.Trial: 12 samples with 1 evaluation.
 Range (min … max):  428.512 ms … 451.603 ms  β”Š GC (min … max): 0.00% … 5.30%
 Time  (median):     433.931 ms               β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   437.729 ms Β±   8.690 ms  β”Š GC (mean Β± Οƒ):  1.95% Β± 2.44%

  ▁ ▁  β–ˆ  ▁  ▁     ▁          ▁               ▁    ▁        ▁ ▁  
  β–ˆβ–β–ˆβ–β–β–ˆβ–β–β–ˆβ–β–β–ˆβ–β–β–β–β–β–ˆβ–β–β–β–β–β–β–β–β–β–β–ˆβ–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–ˆβ–β–β–β–β–ˆβ–β–β–β–β–β–β–β–β–ˆβ–β–ˆ ▁
  429 ms           Histogram: frequency by time          452 ms <

 Memory estimate: 441.25 MiB, allocs estimate: 3367429.

For future reference, of the methods described, the fastest method to search multiple columns for multiple values for a DataFrame in RAM is

For a DataFrame out of RAM

$df[findall(==(val1), df.col1),:][findfirst(==(val2),df[findall(==(val1), df.col1),:].val2),:]

appears to be slightly faster. The most memory efficient method is

But if the dataframe is presorted than

Is faster and more efficient for DataFrames both in and out of RAM. Thanks again for all your help!