[Nerdsnipe warning] Speed up short vector comparisons to beat R

matthias314 · April 25, 2024, 2:12pm

I’d like to seize the opportunity to mention SmallCollections.jl. It provides allocation-free set and (variable size) vector types. In the end there are no new ideas, but the package allows to write rather straightforward code that is fairly efficient:

using SmallCollections, Chairmarks

function count_matches(selection, q, p = length(q))
    v = zeros(SmallVector{8,Int}, p+1)   # works for p <= 7
    for s in selection
        i = length(intersect(q, s))+1
        @inbounds w = setindex(zero(v), 1, i)
        v += w
    end
    v
end

function all_matches(P, p, selection)
    map(q -> count_matches(selection, q, p), Subsets(P, p))
end

P = 22
p = 5
Q = length(Subsets(P, p))
all = collect(Subsets(P, p))   # only needed to generate the selection
selection = unique(rand(all) for _ in 1:Q÷5);

On my not very fancy laptop I get:

julia> @b all_matches($P, $p, $selection)
92.456 ms (2 allocs: 1.808 MiB)

I didn’t try multi-threading or other tricks. My point is that with very little effort one can already get quite a bit of mileage.

nilshg · April 25, 2024, 2:16pm

That is really interesting, runs in ~50ms on my machine (compared to the ~450ms I got with my initial StaticArrays attempt. Definitely worth looking into!

nilshg · April 25, 2024, 3:25pm

And using OhMyThreads I’ve managed to thread it like this:

julia> function all_matchesT(P, p, selection)
           a = collect(Subsets(P, p))
           tmap(q -> count_matches(selection, q, p), a)
       end
all_matchesT (generic function with 1 method)

julia> @b all_matchesT($P, $p, $selection)
14.127 ms (80 allocs: 5.350 MiB)

Not the same speedup from threads, but might be worth sacrificing a bit of speed for conciseness & clarity of code and some flexibility.

rocco_sprmnt21 · April 25, 2024, 4:32pm

This way you could tell your colleagues that he can rewrite it in a few seconds, even.

acs=[mapreduce(b->2^(b-1),+,e) for e in actual_selections]
als=[mapreduce(b->2^(b-1),+,e) for e in all_selections_iterator]
[countmap(count_ones.(acs.& c)) for c ∈ als]

lmiq · April 25, 2024, 4:52pm

Probably just because there is some load unbalancing, which can be alleviated by increasing the number of chunks:

julia> function all_matchesT(P, p, selection)
           a = collect(Subsets(P, p))
           tmap(q -> count_matches(selection, q, p), a; nchunks=2*Threads.nthreads()) # this is the default
       end
all_matchesT (generic function with 1 method)

julia> @b all_matchesT($P, $p, $selection)
10.296 ms (177 allocs: 5.517 MiB)

julia> function all_matchesT(P, p, selection)
           a = collect(Subsets(P, p))
           tmap(q -> count_matches(selection, q, p), a; nchunks=10*Threads.nthreads())
       end
all_matchesT (generic function with 1 method)

julia> @b all_matchesT($P, $p, $selection)
8.612 ms (818 allocs: 4.182 MiB)

rocco_sprmnt21 · April 25, 2024, 8:20pm

I wanted to share an observation, probably of no use but could be an interesting tidbit.
Starting from the fact that the sum of the values per row is constant equal to n = length(actual_selections), only 5 values per row could be calculated without losing any information.
This also appears to bring a small gain in execution time.

julia> [countmap(count_ones.(acs.& c)) for c ∈ als][1:5]
5-element Vector{Dict{Int64, Int64}}:
 Dict(0 => 1140, 4 => 11, 2 => 1239, 3 => 276, 1 => 2125)
 Dict(0 => 1127, 4 => 9, 2 => 1266, 3 => 275, 1 => 2114)
 Dict(0 => 1143, 4 => 18, 2 => 1239, 3 => 270, 1 => 2121)
 Dict(0 => 1151, 4 => 13, 2 => 1239, 3 => 277, 1 => 2111)
 Dict(0 => 1128, 4 => 17, 5 => 1, 2 => 1269, 3 => 253, 1 => 2123)

julia> [countmap(filter(!=(1),count_ones.(acs.& c))) for c ∈ als][1:5]       
5-element Vector{Dict{Int64, Int64}}:
 Dict(0 => 1140, 4 => 11, 2 => 1239, 3 => 276)
 Dict(0 => 1127, 4 => 9, 2 => 1266, 3 => 275)
 Dict(0 => 1143, 4 => 18, 2 => 1239, 3 => 270)
 Dict(0 => 1151, 4 => 13, 2 => 1239, 3 => 277)
 Dict(0 => 1128, 4 => 17, 5 => 1, 2 => 1269, 3 => 253)

julia> @btime [countmap(count_ones.(acs.& c)) for c ∈ als];
  742.389 ms (394889 allocations: 1.90 GiB)

julia> @btime [countmap(filter!(!=(1),count_ones.(acs.& c))) for c ∈ als];   
  448.061 ms (421223 allocations: 1.48 GiB)

julia> function cm(rowsc)
           d=zeros(Int,6)
           for r in rowsc
               d[r+1]+=1
           end
           d
       end
cm (generic function with 1 method)


julia> @btime [cm(filter!(!=(1),count_ones.(acs.& c))) for c ∈ als];
  299.383 ms (263217 allocations: 971.41 MiB)


julia> [cm(filter!(!=(1),count_ones.(acs.& c))) for c ∈ als][1:5]
5-element Vector{Vector{Int64}}:
 [1140, 0, 1239, 276, 11, 0]  # 0 in position 2 is the hidden info = n -sum(others)
 [1127, 0, 1266, 275, 9, 0]   # 0 in position 6 means that no 5 selections match
 [1143, 0, 1239, 270, 18, 0]
 [1151, 0, 1239, 277, 13, 0]
 [1128, 0, 1269, 253, 17, 1]

rocco_sprmnt21 · April 26, 2024, 3:08pm

This version is much less efficient than the commented one.
What is the main reason?
Do conditional expressions prevent multiple “simultaneous” operations on chunks of data (I can’t be more precise, but I hope you understand what I mean)?

In this version more tests are done but fewer indexing and summing operations.
I don’t know the specific weight of these machine-level operations, but I might think that, in principle, it is better to have ifs and a sum rather than indexing and many sums. Or not?

function count_matches(selection, q, p = length(q))
    #v = zeros(SmallVector{8,Int}, p+1)  # works for p <= 7
    a,b,c,d,e,f = zeros(Int,6)  
    for s in selection
        i = length(intersect(q, s))+1
        i==2 ? b+=1 :
        i==3 ? c+=1 :
        i==1 ? a+=1 :
        i==4 ? d+=1 :
        i==5 ? e+=1 :
        i==6 ? f+=1 :
        error("fuori range")
        # @inbounds w = setindex(zero(v), 1, i)
        # v += w
    end
    a,b,c,d,e,f
end

matthias314 · April 26, 2024, 5:14pm

I don’t know if this applies in your situation, but when it comes to vectorizing operations, ifelse(a, b, c) is often better than if a; b else c end or a ? b : c.

JackStrauss · April 29, 2024, 9:31pm

Well done. I have done a fair bit of taking code from R to Julia, and I have never seen 400x. A lot of 5x.

But R can also be fast, I am managing an R project that takes days to run on a large AWS instance, and in a codebase of ~10K lines, 95% of the time is 3 lines in R, and the rest of the time consumption is a few lines of data.table. I tried those 3 lines in Julia, and R was faster! data.table was much faster than Julia dataframes. What happens in those 3 lines is linear algebra with small vectors (~20 elements).

I would love to maintain that in Julia, but my boss would laugh at me if I asked.

js

lmiq · April 29, 2024, 9:51pm

Did you try the “Why is R faster than Julia?” approach?

JackStrauss · April 29, 2024, 9:55pm

Never heard of that! Do elaborate.

This is admittingly a few Julia versions back, but I really wanted Julia to be faster and tried everything I could think of.

lmiq · April 29, 2024, 9:58pm

It is a joke. Start a forum thread with that title and put MWE where that happens. Always works.

(Realistically speaking, linear algebra with small vectors may be faster in R but normally that is solved using static arrays, or switching to MKL or, some times, restructuring the problem in way not possible in R by fusing many sequential vector operations in a loop, sometimes with big gains).

JackStrauss · April 29, 2024, 10:00pm

I know!

I always find it funny to see how people react when jokes are taken seriously.

JackStrauss · April 29, 2024, 10:08pm

yes, have seen and done that. (well can’t use MKL) Downside is that converting code that runs correctly in 3 lines of R, but need something more complicated in Julia to be faster, may not be justifiable. I like what Knuth said about premature optimization as the root of all evil

GunnarFarneback · April 30, 2024, 5:41am

If three lines of code account for 95% of days of runtime, it sounds like anything but premature to try to optimize those.

nilshg · April 30, 2024, 7:52am

As the others say make a thread with an MWE and we’ll see. Incidentally the R code I was working off of was also using data.table, it just wasn’t the right tool for the job. The h2o benchmarks suggest that DataFrames is very competitive with data.table, so I’d be surprised to see data.table being “much faster” than DataFrames.

JackStrauss · April 30, 2024, 8:47am

yes, in a lot of cases yes, and I have ported several R projects to Julia.

But it will depend on length of job and amount of code around it. For this project, my employer would rather rent the AWS 96 core instance running it for several hours more each month than trying to save that money by rewriting. The cost of rewrite will exceed the AWS savings.

JackStrauss · April 30, 2024, 8:54am

yes, I am guilty of exaggeration. when I did the comparison a few versions back it was perhaps 30% time difference. and since I know data.tables much better than dataframes, that probably is only down to my ignorance.

For me, the biggest problem is how different the data.table syntax is to dataframes. which makes it very time consuming and risky to port R code that heavily uses data.tables to Julia.

Topic		Replies	Views
Can this be made faster? Performance dataframes	5	541	March 19, 2022
More efficient way of getting permutations New to Julia	3	357	August 13, 2021
Fast in.(x, Ref(y)) Performance	4	515	May 18, 2020
Performance issue when searching for an occurrence in larger matrix Performance	3	343	July 18, 2020
Calculating all possible combinations Performance	8	13811	July 9, 2020

[Nerdsnipe warning] Speed up short vector comparisons to beat R

Related topics