Using findfirst with for multiple values and columns in a Dataframe?

phantom · March 21, 2024, 9:35pm

Is there a way to extend the findfirst(==(val1), df.col1) for more than one value over more than column in a dataframe. Something like

findfirst(==([ val1, val2]) , [df.col1, df.col2])

I tried

findfirst(row -> (row[1] == val1 && row[2] == val2), eachrow(select(df, [:col1, :col2])))

but that seems to cause a significant slowdown

@benchmark findfirst(row -> (row[1] == Date(2024,2,23) && row[2] == Time(10,00,00)), eachrow(select(df1, [:date, :time])))
BenchmarkTools.Trial: 9 samples with 1 evaluation.
 Range (min … max):  602.970 ms … 658.486 ms  ┊ GC (min … max): 0.00% … 6.52%
 Time  (median):     608.596 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   626.526 ms ±  25.573 ms  ┊ GC (mean ± σ):  2.97% ± 3.38%

  █▁  ▁ ▁                                          ▁▁        ▁▁  
  ██▁▁█▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁██▁▁▁▁▁▁▁▁██ ▁
  603 ms           Histogram: frequency by time          658 ms <

 Memory estimate: 284.43 MiB, allocs estimate: 14908879.

as compared to creating a subset with findall on val1 then findfirst on the subset with val2 e.g.

@benchmark df1[findall(==(Date(2024,2,23)), df1.date),:][findfirst(==(Time(10,00,00)),df1[findall(==(Date(2024,2,23)), df1.date),:].time),:]
BenchmarkTools.Trial: 762 samples with 1 evaluation.
 Range (min … max):  5.964 ms … 22.915 ms  ┊ GC (min … max): 0.00% … 72.43%
 Time  (median):     6.324 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   6.562 ms ±  1.695 ms  ┊ GC (mean ± σ):  3.08% ±  8.16%

   █▄▁                                                        
  ████▆▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▄▁▁▁▁▁▁▁▁▁▄ ▇
  5.96 ms      Histogram: log(frequency) by time     19.1 ms <

 Memory estimate: 7.27 MiB, allocs estimate: 2008.

or masking directly

df[ df.col1 .== val1 .&& df.col2 .== val2, :][1,:]

julia> @benchmark  df1[ df1.date .== Date(2024,2,23) .&& df1.time .== Time(10,00,00), :][1,:]
BenchmarkTools.Trial: 186 samples with 1 evaluation.
 Range (min … max):  26.848 ms …  27.793 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     26.971 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   27.008 ms ± 136.690 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

        ▃█▇▂▃▄                                                  
  ▃▆▅▇▆███████▆▅▆▃▄▆▆▆▃▅▃▁▄▃▃▃▃▄▁▃▁▁▁▁▁▃▁▁▁▃▁▁▁▁▃▁▁▁▁▁▁▁▃▁▁▁▁▃ ▃
  26.8 ms         Histogram: frequency by time         27.6 ms <

 Memory estimate: 461.77 KiB, allocs estimate: 36.

I checked the documentation but it did not include anything on it so just wondering if there was a more direct approach using findfirst for more multiple values over multiple arrays.

nilshg · March 21, 2024, 9:43pm

findfirst(df.col1 .== val1 .&& df.col2 .== val2)

phantom · March 21, 2024, 10:24pm

awesome thanks! Just curious why is subsetting with findall on val1 first meaningfully faster?

 @benchmark df1[findfirst(df1.date .==(Date(2024,2,23)) .&& df1.time .==(Time(10,00,00))),:]
BenchmarkTools.Trial: 186 samples with 1 evaluation.
 Range (min … max):  26.772 ms … 27.584 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     26.872 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   26.891 ms ± 96.213 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

        ▄▇▅▂▆▃▄█ ▃▂▁▁                                          
  ▃▃▃▆▆▇█████████████▇█▅▃▃▃▁▄▆▁▁▃▁▃▁▁▁▁▁▁▃▃▃▃▁▃▁▁▁▁▁▁▃▁▁▃▁▃▁▃ ▃
  26.8 ms         Histogram: frequency by time        27.2 ms <

 Memory estimate: 460.20 KiB, allocs estimate: 12.

rocco_sprmnt21 · March 21, 2024, 10:46pm

findfirst(==((v1,v2)), tuple.(df.c1,df.c2))

rocco_sprmnt21 · March 22, 2024, 8:08am

What expression are you referring to?
Can you post a data set to test on?

I’ve tried different ways to check times and allocations and something like the following seems the most efficient (although I don’t understand why it has even a small allocation):


function ffr(c1,c2,f,l)
    i=1
    for (fr,lr) in zip(c1,c2)
        fr == f && lr == l && return i
        i+=1
    end
end

when compared to expressions like these


findfirst(==((5,5)), tuple.(df.c1,df.c2))
findfirst(==((5,5)), collect(zip(df.c1,df.c2)))

 findfirst(t->t[1]==5 &&t[2]==5, tuple.(df.c1,df.c2))

 findfirst(r->(r.c1,r.c2)==(5,5), eachrow(df))
 findfirst(r->r.c1==5 && r.c2==5, eachrow(df))

 findfirst(df.c1 .== 5 .&& df.c2 .== 5)

I used this df for testing

df=DataFrame(c1=rand(1:4,10^3),c2=rand(1:4,10^3),c3=rand(1:4,10^3))
push!(df,(5,5,5))

phantom · March 22, 2024, 9:55pm

I think the main speed difference comes from the fact that the data was read from an .arrow file and therefore not stored in memory. Also I’m probably not handling interpolation correctly? Having the dataset in chronological order also makes a difference but not as much.

Accounting for size and different data types did not make much difference.

df2=DataFrame(c1=rand(1:4,10^7),c2=rand(1:4,10^7),c3=rand(1:4,10^7)); 
df2.dates = rand(Date(2023,10,1):today(),nrow(df2));
df2.times = rand(Time(0,0):Hour(1):Time(23,59), nrow(df2));

The times for the methods from fastest to slowest I got were

@benchmark $df2[findfirst(df2.dates .==Date(2024,2,23) .&& df2.times .== Time(10,00,00)),:]
BenchmarkTools.Trial: 1569 samples with 1 evaluation.
 Range (min … max):  3.027 ms …  26.248 ms  ┊ GC (min … max): 0.00% … 87.62%
 Time  (median):     3.144 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   3.180 ms ± 762.015 μs  ┊ GC (mean ± σ):  0.85% ±  3.09%

               ▇█▄▅▄▂                                          
  ▂▂▂▃▂▂▁▂▁▁▂▂▄███████▇▆▅▄▃▃▃▂▂▂▂▂▂▁▂▂▁▁▂▂▁▁▁▁▁▁▁▁▂▂▁▂▁▁▁▁▂▁▂ ▃
  3.03 ms         Histogram: frequency by time        3.44 ms <
Memory estimate: 1.20 MiB, allocs estimate: 12.

@benchmark $df2[findall(==(Date(2024,2,23)), df2.dates),:][findfirst(==(Time(10,00,00)),df2[findall(==(Date(2024,2,23)), df2.dates),:].times),:]
BenchmarkTools.Trial: 1128 samples with 1 evaluation.
 Range (min … max):  4.144 ms … 32.142 ms  ┊ GC (min … max): 0.00% … 86.06%
 Time  (median):     4.339 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   4.428 ms ±  1.329 ms  ┊ GC (mean ± σ):  1.73% ±  4.89%

              ▄▅▇██▇▇▆▆▄▃▂▂                                  ▁
  ▆▆▆▅▅▆▅▄▅▅▇██████████████▇▆▇▄▆▅▄▄▆▅▁▁▁▄▁▄▁▄▁▄▁▅▁▁▁▅▁▁▁▁▁▁▅ █
  4.14 ms      Histogram: log(frequency) by time     4.82 ms <
Memory estimate: 2.58 MiB, allocs estimate: 50.

@benchmark $df2[findfirst(==((Date(2024,2,23),Time(10,00,00))), tuple.(df2.dates,df2.times)),:]
BenchmarkTools.Trial: 350 samples with 1 evaluation.
 Range (min … max):  11.626 ms … 25.382 ms  ┊ GC (min … max):  0.00% … 44.17%
 Time  (median):     12.023 ms              ┊ GC (median):     0.00%
 Time  (mean ± σ):   14.319 ms ±  4.501 ms  ┊ GC (mean ± σ):  15.44% ± 19.05%

  █▆                                                           
  ██▇▄▄▃▃▃▂▃▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂▁▁▁▁▁▃▅▄▄▃▃▃▂ ▂
  11.6 ms         Histogram: frequency by time        24.3 ms <
Memory estimate: 152.59 MiB, allocs estimate: 6

Having the data in chronological order accounted for some small difference in time but not much. As again nilshg’s method is the fastest.

df3 = sort(df2, [:dates, :times])

@benchmark $df3[findfirst(df3.dates .==Date(2024,2,23) .&& df3.times .== Time(10,00,00)),:]
BenchmarkTools.Trial: 1548 samples with 1 evaluation.
 Range (min … max):  3.070 ms … 39.209 ms  ┊ GC (min … max): 0.00% … 91.70%
 Time  (median):     3.176 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   3.222 ms ±  1.028 ms  ┊ GC (mean ± σ):  1.08% ±  3.16%

                  ▄█▃▁ ▂▁                                     
  ▂▂▂▂▂▂▂▂▂▁▂▁▁▂▃▆███████▇▆▄▄▄▄▄▃▃▃▂▃▃▂▂▂▂▂▂▂▂▂▂▂▁▂▂▁▂▂▂▂▁▂▂ ▃
  3.07 ms        Histogram: frequency by time        3.37 ms <

 Memory estimate: 1.20 MiB, allocs estimate: 12.

@benchmark $df3[findall(==(Date(2024,2,23)), df3.dates),:][findfirst(==(Time(10,00,00)),df3[findall(==(Date(2024,2,23)), df3.dates),:].times),:]
BenchmarkTools.Trial: 1214 samples with 1 evaluation.
 Range (min … max):  3.959 ms … 29.343 ms  ┊ GC (min … max): 0.00% … 85.85%
 Time  (median):     4.039 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   4.113 ms ±  1.132 ms  ┊ GC (mean ± σ):  1.34% ±  4.19%

       █▇▃▁                                                   
  ▂▁▁▃▇████▆▄▃▃▃▃▂▂▂▂▁▁▁▂▁▂▁▂▂▁▁▁▁▁▁▁▁▁▂▂▁▁▁▁▁▁▁▁▁▁▁▁▂▁▁▁▁▁▂ ▃
  3.96 ms        Histogram: frequency by time        4.64 ms <
Memory estimate: 2.58 MiB, allocs estimate: 51.

@benchmark $df3[findfirst(==((Date(2024,2,23),Time(10,00,00))), tuple.(df3.dates,df3.times)),:]
BenchmarkTools.Trial: 243 samples with 1 evaluation.
 Range (min … max):  17.869 ms … 35.774 ms  ┊ GC (min … max):  0.00% … 44.32%
 Time  (median):     18.219 ms              ┊ GC (median):     0.00%
 Time  (mean ± σ):   20.613 ms ±  5.339 ms  ┊ GC (mean ± σ):  10.51% ± 15.80%

  █▆▃▂▁                                               ▂▁       
  █████▆██▅▄▁▄▄▁▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▄▁▁▁▁▁▄▁▁▁▁▁▅██▆▅▁▄▄ ▅

Memory estimate: 152.59 MiB, allocs estimate: 6.

Majority of the difference came from reading the data from a .arrow that was not brought into RAM.

Arrow.write("filepath/datatest.arrow", df3)
df4 = DataFrame(Arrow.Table("/filepath/datatest.arrow")

@benchmark $df4[findall(==(Date(2024,2,23)), df4.dates),:][findfirst(==(Time(10,00,00)),df4[findall(==(Date(2024,2,23)), df4.dates),:].times),:]
BenchmarkTools.Trial: 1177 samples with 1 evaluation.
 Range (min … max):  3.996 ms … 46.623 ms  ┊ GC (min … max): 0.00% … 91.01%
 Time  (median):     4.103 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   4.244 ms ±  2.093 ms  ┊ GC (mean ± σ):  2.82% ±  5.19%

           ▂▇██▇▆▆▄▃▃▂   ▁                                    
  ▄▁▁▁▄▄▄▄▆████████████▇██▇█▇▅▇▁▄▆▁▅▁▅▄▆▁▄▄▅▄▅▆▁▁▅▄▆▁▁▁▅▅▁▆▆ █
  4 ms         Histogram: log(frequency) by time     4.47 ms <

Memory estimate: 2.58 MiB, allocs estimate: 51.

julia> @benchmark $df4[findfirst(df4.dates .==Date(2024,2,23) .&& df4.times .== Time(10,00,00)),:]
BenchmarkTools.Trial: 555 samples with 1 evaluation.
 Range (min … max):  8.882 ms …  9.872 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     8.988 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   9.001 ms ± 65.966 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

               ▆█▆▄ ▁                                         
  ▂▂▁▁▁▁▁▁▁▁▁▃███████▆▇▄▄▃▄▃▃▂▂▂▂▁▂▂▂▁▁▁▁▁▁▁▂▁▂▁▁▁▂▁▁▁▁▁▁▁▁▂ ▃
  8.88 ms        Histogram: frequency by time        9.27 ms <

 Memory estimate: 1.20 MiB, allocs estimate: 12.

julia> @benchmark $df4[findfirst(==((Date(2024,2,23),Time(10,00,00))), tuple.(df4.dates,df4.times)),:]
BenchmarkTools.Trial: 218 samples with 1 evaluation.
 Range (min … max):  20.146 ms … 35.459 ms  ┊ GC (min … max): 0.00% … 36.70%
 Time  (median):     20.695 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   22.958 ms ±  4.874 ms  ┊ GC (mean ± σ):  9.31% ± 14.25%

  ▄█▆▂▂▁                                              ▂▁       
  ██████▆▇▆▆▄▆▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▄▁▁▁▁▁▄▁▁▁▁▁▁██▇▄▆▆▄ ▆
  20.1 ms      Histogram: log(frequency) by time      35.2 ms <

 Memory estimate: 152.59 MiB, allocs estimate: 6.

There’s a similar but smaller slowdown with the ffr() function when reading an arrow file. I’m not sure why this is or is it just because i’m handling interpolation incorrectly?

rocco_sprmnt21 · March 22, 2024, 10:09pm

ffr seems much more efficient or am I wrong?

julia> function ffr(c1,c2,f,l)
           i=1
           for (fr,lr) in zip(c1,c2)
               fr == f && lr == l && return i
               i+=1
           end
       end
ffr (generic function with 1 method)

julia> @btime df2[ffr(df2.dates,df2.times,Date(2024,2,23),Time(10,00,00)),:]
  3.587 μs (3 allocations: 80 bytes)
DataFrameRow
  Row │ c1     c2     c3     dates       times    
      │ Int64  Int64  Int64  Date        Time
──────┼───────────────────────────────────────────
 7482 │     2      1      1  2024-02-23  10:00:00

julia> @btime df2[findfirst(df2.dates .==Date(2024,2,23) .&& df2.times .== Time(10,00,00)),:]
  15.419 ms (14 allocations: 1.20 MiB)
DataFrameRow
  Row │ c1     c2     c3     dates       times    
      │ Int64  Int64  Int64  Date        Time
──────┼───────────────────────────────────────────
 7482 │     2      1      1  2024-02-23  10:00:00

phantom · March 22, 2024, 10:34pm

yeah you’re right in terms of efficiency. I was just looking at execution times. These are the results I got.

ffr() from RAM

julia> @benchmark $ffr(df3.dates,df3.times,Date(2024,2,23),Time(10,00,00))
BenchmarkTools.Trial: 800 samples with 1 evaluation.
 Range (min … max):  6.179 ms …  6.603 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     6.225 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   6.239 ms ± 57.197 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

   █▇▂▆▆▁▁▂▁▁  ▁ ▁▃                                           
  ████████████▇█▇██▆▅▅▃▄▄▄▄▃▄▃▄▂▃▄▁▃▂▂▂▂▃▂▁▃▁▂▁▂▁▂▂▂▂▂▂▂▁▁▂▂ ▄
  6.18 ms        Histogram: frequency by time        6.47 ms <

 Memory estimate: 16 bytes, allocs estimate: 1.

ffr() From Arrow.Table


julia> @benchmark $ffr(df4.dates,df4.times,Date(2024,2,23),Time(10,00,00))
BenchmarkTools.Trial: 535 samples with 1 evaluation.
 Range (min … max):  9.295 ms …  9.428 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     9.330 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   9.332 ms ± 22.676 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

      ▂█▁                ▃▁  ▂                                
  ▄▅▃▃███▇▅▆█▇█▇▇█▇█▇██▇██████▅▄▅▅▆▄▄▄▅▂▃▃▂▁▃▃▂▂▂▂▄▁▃▃▂▁▁▂▁▂ ▄
  9.29 ms        Histogram: frequency by time        9.41 ms <

 Memory estimate: 16 bytes, allocs estimate: 1.

Dan · March 22, 2024, 10:47pm

If the data is sorted, then you should use searchsortedfirst instead of findfirst and it should make a significant difference.

phantom · March 22, 2024, 11:00pm

awesome thanks! What would be the appropriate syntax to apply searchsortedfirst to multiple values from multiple columns in a data frame? The docs only include examples with a single array.

Dan · March 22, 2024, 11:07pm

Are the rows sorted by df.dates and df.times jointly?

phantom · March 23, 2024, 2:49am

yes, they are in chronological order by both date and time.

Dan · March 23, 2024, 3:31am

So try to benchmark this:

df3[searchsortedfirst(1:nrow(df3),0; 
  by=i->(i==0 ? (Date(2024,2,23),Time(10,00,00)) :
  (df3.dates[i],df3.times[i]))
), :]

phantom · March 23, 2024, 4:57am

Oh Wow! that’s meaningfully faster - even with the DataFrame out of RAM

Sorted DataFrame in RAM

julia> @benchmark $df3[searchsortedfirst(1:nrow(df3),0; 
         by=i->(i==0 ? (Date(2024,2,23),Time(10,00,00)) :
         (df3.dates[i],df3.times[i]))), :]
BenchmarkTools.Trial: 10000 samples with 5 evaluations.
 Range (min … max):  6.075 μs …  21.725 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     6.317 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   6.350 μs ± 274.666 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

       ▂▅▇█▆▆▄▂▄▆▇▆▆▅▂▁                                        
  ▁▁▂▄▇█████████████████▆▆▄▅▃▃▂▂▂▂▂▁▁▁▁▁▁▁▁▁▂▂▂▂▂▂▂▂▂▂▂▁▂▁▁▁▁ ▃
  6.08 μs         Histogram: frequency by time        7.05 μs <

 Memory estimate: 2.34 KiB, allocs estimate: 125.

Sorted DataFrame out of RAM

julia> @benchmark $df4[searchsortedfirst(1:nrow(df4),0; by=i->(i==0 ? (Date(2024,2,23),Time(10,00,00)) :
         (df4.dates[i],df4.times[i]))), :]
BenchmarkTools.Trial: 10000 samples with 5 evaluations.
 Range (min … max):  5.975 μs …  12.100 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     6.350 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   6.381 μs ± 230.581 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

             ▅▃▄▄▃▂▆▁▃▄▄▅█▃▁▁                                  
  ▂▁▁▂▂▃▃▄▄▆█████████████████▇▇▇▅▅▄▄▄▄▃▃▃▃▃▃▃▃▃▃▂▃▃▃▃▃▃▃▃▃▂▂▂ ▄
  5.98 μs         Histogram: frequency by time        7.04 μs <

 Memory estimate: 2.34 KiB, allocs estimate: 125.

Sorry I’m still having trouble understanding the syntax though. Would you mind clarifying how this works?

searchsortedfirst(1:nrow(df4),0;  by=i->(i==0 ? (Date(2024,2,23),Time(10,00,00)) : (df4.dates[i],df4.times[i])))

The examples in the docs show searchsortedfirst(array, value). I’m assuming that 1:nrow(df4) provides a range to search through but i don’t quite understand where the actual array is being input into the function and how the by clause works with the ternary operator here? Any clarification would be much appreciated thanks!

Dan · March 23, 2024, 7:48am

Yes, the 1:nrow(df4) is the range. Because I’m using the by named parameter which operates on the items searched and the item to search for, I’ve had to use this slightly awkward function which for input i=0 returns the item to search for, and for other i returns the item searched against (the df enters the calculation here).
Additionally, Tuples are sorted by lexicographic order by default, which means an equal Tuple will also be the first Tuple >= the desired value.

rocco_sprmnt21 · March 23, 2024, 8:29pm

The function has this fingerprint searchsortedfirst(a, x; by=<transform>, lt=<comparison>, rev=false).
The comparison between the value to be searched for x and the candidate values a_i is done in the following way lt(by(x),by(a_i)) until the first a_i >=x is found.
I’ll try to give some examples to clarify, in the end, Dan’s “slightly awkward” use of it (euphemism).

julia> searchsortedfirst(1:10,-3)
1

julia> searchsortedfirst(1:10,-3,by= abs)
3

julia> searchsortedfirst(1:10,0,by=a_i->(a_i==0 ? 3 : sqrt(a_i)))
9

How does this last case work?
You are required to find the first value in 1:10 that is >=0.
But the comparison is done by first transforming the values with by().
therefore the
first comparison lt( by(x), by(a_1)) is 3 <= sqrt(1);
the second comparison lt( by(x), by(a_1)) is 3 <= sqrt(2);
…
the ninth comparison lt( by(x), by(a_1)) is 3 <= sqrt(9); it’s the “good” one.

You might think that 0 is a particular value but it plays no specific role.
See the examples below:

julia> searchsortedfirst(1:10,NaN,by=x->(x===NaN ? 3 : sqrt(x)))
9

julia> searchsortedfirst(1:10,'x',by=x->(x=='x' ? 3 : sqrt(x)))
9

I add an alternative solution that could be useful if you have to do a lot of research.


gdf=groupby(df2,[:dates,:times]) 

d=Dict(zip(Tuple.(keys(gdf)),gdf.starts)) 

d[(Date(2024,2,23),Time(10,00,00))]

phantom · March 23, 2024, 10:31pm

Wow, thanks so much for the detailed explanations. Sorry i’m slow but I think I might still be missing something.

Based on the following

and

my understanding is that

searchsortedfirst(1:nrow(df4),0;  by=i->(i==0 ? (Date(2024,2,23),Time(10,00,00)) : (df4.dates[i],df4.times[i])))

Applies the function in by to both the array and the search term. 1:nrow(df4) will always return df4.dates, df4.times and 0 will always return the Date and Time searched for because the ternary operator is applying to both terms.

If this understanding is correct and given this example here

why does?

searchsortedfirst( -1:-10,-3,by= abs)
1

Shouldn’t this be equivalent to

abs(3) .<= collect(-1:-10)
0-element BitVector

because

isempty(collect(-1:-10))
true

Are empty sets the only exception to this rule? I just want to make sure i’m not missing something thanks because sometimes its hard to find this information with the ?.

rocco_sprmnt21 · March 23, 2024, 10:39pm

Since it is an empty empty collection, it returns the last index (i.e. 0) +1
this one instead


julia> searchsortedfirst( -1:-1:-10,-3,by= abs)
3

I take this opportunity to give a more readable version of the function applied to the case, which is not much less efficient than the slightly less readable one

searchsortedfirst(eachrow(df2), df2[8331846,:], by=r->(r.dates,r.times))

Dan · March 23, 2024, 10:54pm

The problem, of course, is that 8331846 is unknown when coding the searchsortedfirst

phantom · March 24, 2024, 2:44am

Got it thanks. That clears up alot. Also thanks @rocco_sprmnt21 for this additional method. I’ve included the benchmark for this method just for reference.

for sorted DataFrame in RAM

 @benchmark research(df3, Date(2024,2,23),Time(10,00,00))
BenchmarkTools.Trial: 13 samples with 1 evaluation.
 Range (min … max):  378.150 ms … 435.834 ms  ┊ GC (min … max): 0.00% … 5.96%
 Time  (median):     383.107 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   392.793 ms ±  16.393 ms  ┊ GC (mean ± σ):  2.44% ± 2.75%

      ▃█                ▃                                        
  ▇▁▇▁██▁▁▁▁▁▇▁▁▁▁▁▁▁▁▁▇█▁▁▁▁▁▁▁▁▁▁▁▁▇▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▇ ▁
  378 ms           Histogram: frequency by time          436 ms <

Memory estimate: 441.24 MiB, allocs estimate: 3367422.

For sorted DataFrame out of Ram

julia> @benchmark research(df4, Date(2024,2,23),Time(10,00,00))
BenchmarkTools.Trial: 12 samples with 1 evaluation.
 Range (min … max):  428.512 ms … 451.603 ms  ┊ GC (min … max): 0.00% … 5.30%
 Time  (median):     433.931 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   437.729 ms ±   8.690 ms  ┊ GC (mean ± σ):  1.95% ± 2.44%

  ▁ ▁  █  ▁  ▁     ▁          ▁               ▁    ▁        ▁ ▁  
  █▁█▁▁█▁▁█▁▁█▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁█▁▁▁▁▁▁▁▁█▁█ ▁
  429 ms           Histogram: frequency by time          452 ms <

 Memory estimate: 441.25 MiB, allocs estimate: 3367429.

For future reference, of the methods described, the fastest method to search multiple columns for multiple values for a DataFrame in RAM is

For a DataFrame out of RAM

$df[findall(==(val1), df.col1),:][findfirst(==(val2),df[findall(==(val1), df.col1),:].val2),:]

appears to be slightly faster. The most memory efficient method is

rocco_sprmnt21:

julia> function ffr(c1,c2,f,l)
           i=1
           for (fr,lr) in zip(c1,c2)
               fr == f && lr == l && return i
               i+=1
           end
       end
ffr (generic function with 1 method)

But if the dataframe is presorted than

Is faster and more efficient for DataFrames both in and out of RAM. Thanks again for all your help!

Topic		Replies	Views
Findall/Findfirst across multiple columns in dataframe New to Julia	15	2345	March 26, 2021
DataFrames: obtaining the subset of rows by a set of values New to Julia dataframes	45	24001	April 27, 2024
Learning to benchmark and find the best function to select a subset of a dataframe New to Julia question	20	444	December 16, 2022
Lighter alternatives to a dataframe when you just want to run a few queries on a vector of data structs General Usage	31	961	August 12, 2023
Who does "better" than DataFrames? Performance dataframes	43	2015	April 6, 2023

Using findfirst with for multiple values and columns in a Dataframe?

Related topics