[Nerdsnipe warning] Speed up short vector comparisons to beat R

I’d like to seize the opportunity to mention SmallCollections.jl. It provides allocation-free set and (variable size) vector types. In the end there are no new ideas, but the package allows to write rather straightforward code that is fairly efficient:

using SmallCollections, Chairmarks

function count_matches(selection, q, p = length(q))
    v = zeros(SmallVector{8,Int}, p+1)   # works for p <= 7
    for s in selection
        i = length(intersect(q, s))+1
        @inbounds w = setindex(zero(v), 1, i)
        v += w
    end
    v
end

function all_matches(P, p, selection)
    map(q -> count_matches(selection, q, p), Subsets(P, p))
end

P = 22
p = 5
Q = length(Subsets(P, p))
all = collect(Subsets(P, p))   # only needed to generate the selection
selection = unique(rand(all) for _ in 1:Q÷5);

On my not very fancy laptop I get:

julia> @b all_matches($P, $p, $selection)
92.456 ms (2 allocs: 1.808 MiB)

I didn’t try multi-threading or other tricks. My point is that with very little effort one can already get quite a bit of mileage.

7 Likes

That is really interesting, runs in ~50ms on my machine (compared to the ~450ms I got with my initial StaticArrays attempt. Definitely worth looking into!

And using OhMyThreads I’ve managed to thread it like this:

julia> function all_matchesT(P, p, selection)
           a = collect(Subsets(P, p))
           tmap(q -> count_matches(selection, q, p), a)
       end
all_matchesT (generic function with 1 method)

julia> @b all_matchesT($P, $p, $selection)
14.127 ms (80 allocs: 5.350 MiB)

Not the same speedup from threads, but might be worth sacrificing a bit of speed for conciseness & clarity of code and some flexibility.

6 Likes

This way you could tell your colleagues that he can rewrite it in a few seconds, even. :smile:

acs=[mapreduce(b->2^(b-1),+,e) for e in actual_selections]
als=[mapreduce(b->2^(b-1),+,e) for e in all_selections_iterator]
[countmap(count_ones.(acs.& c)) for c ∈ als]
1 Like

Probably just because there is some load unbalancing, which can be alleviated by increasing the number of chunks:

julia> function all_matchesT(P, p, selection)
           a = collect(Subsets(P, p))
           tmap(q -> count_matches(selection, q, p), a; nchunks=2*Threads.nthreads()) # this is the default
       end
all_matchesT (generic function with 1 method)

julia> @b all_matchesT($P, $p, $selection)
10.296 ms (177 allocs: 5.517 MiB)

julia> function all_matchesT(P, p, selection)
           a = collect(Subsets(P, p))
           tmap(q -> count_matches(selection, q, p), a; nchunks=10*Threads.nthreads())
       end
all_matchesT (generic function with 1 method)

julia> @b all_matchesT($P, $p, $selection)
8.612 ms (818 allocs: 4.182 MiB)
4 Likes

I wanted to share an observation, probably of no use but could be an interesting tidbit.
Starting from the fact that the sum of the values ​​per row is constant equal to n = length(actual_selections), only 5 values ​​per row could be calculated without losing any information.
This also appears to bring a small gain in execution time.

julia> [countmap(count_ones.(acs.& c)) for c ∈ als][1:5]
5-element Vector{Dict{Int64, Int64}}:
 Dict(0 => 1140, 4 => 11, 2 => 1239, 3 => 276, 1 => 2125)
 Dict(0 => 1127, 4 => 9, 2 => 1266, 3 => 275, 1 => 2114)
 Dict(0 => 1143, 4 => 18, 2 => 1239, 3 => 270, 1 => 2121)
 Dict(0 => 1151, 4 => 13, 2 => 1239, 3 => 277, 1 => 2111)
 Dict(0 => 1128, 4 => 17, 5 => 1, 2 => 1269, 3 => 253, 1 => 2123)

julia> [countmap(filter(!=(1),count_ones.(acs.& c))) for c ∈ als][1:5]       
5-element Vector{Dict{Int64, Int64}}:
 Dict(0 => 1140, 4 => 11, 2 => 1239, 3 => 276)
 Dict(0 => 1127, 4 => 9, 2 => 1266, 3 => 275)
 Dict(0 => 1143, 4 => 18, 2 => 1239, 3 => 270)
 Dict(0 => 1151, 4 => 13, 2 => 1239, 3 => 277)
 Dict(0 => 1128, 4 => 17, 5 => 1, 2 => 1269, 3 => 253)

julia> @btime [countmap(count_ones.(acs.& c)) for c ∈ als];
  742.389 ms (394889 allocations: 1.90 GiB)

julia> @btime [countmap(filter!(!=(1),count_ones.(acs.& c))) for c ∈ als];   
  448.061 ms (421223 allocations: 1.48 GiB)
julia> function cm(rowsc)
           d=zeros(Int,6)
           for r in rowsc
               d[r+1]+=1
           end
           d
       end
cm (generic function with 1 method)


julia> @btime [cm(filter!(!=(1),count_ones.(acs.& c))) for c ∈ als];
  299.383 ms (263217 allocations: 971.41 MiB)


julia> [cm(filter!(!=(1),count_ones.(acs.& c))) for c ∈ als][1:5]
5-element Vector{Vector{Int64}}:
 [1140, 0, 1239, 276, 11, 0]  # 0 in position 2 is the hidden info = n -sum(others)
 [1127, 0, 1266, 275, 9, 0]   # 0 in position 6 means that no 5 selections match
 [1143, 0, 1239, 270, 18, 0]
 [1151, 0, 1239, 277, 13, 0]
 [1128, 0, 1269, 253, 17, 1]

This version is much less efficient than the commented one.
What is the main reason?
Do conditional expressions prevent multiple “simultaneous” operations on chunks of data (I can’t be more precise, but I hope you understand what I mean)?

In this version more tests are done but fewer indexing and summing operations.
I don’t know the specific weight of these machine-level operations, but I might think that, in principle, it is better to have ifs and a sum rather than indexing and many sums. Or not?

function count_matches(selection, q, p = length(q))
    #v = zeros(SmallVector{8,Int}, p+1)  # works for p <= 7
    a,b,c,d,e,f = zeros(Int,6)  
    for s in selection
        i = length(intersect(q, s))+1
        i==2 ? b+=1 :
        i==3 ? c+=1 :
        i==1 ? a+=1 :
        i==4 ? d+=1 :
        i==5 ? e+=1 :
        i==6 ? f+=1 :
        error("fuori range")
        # @inbounds w = setindex(zero(v), 1, i)
        # v += w
    end
    a,b,c,d,e,f
end

I don’t know if this applies in your situation, but when it comes to vectorizing operations, ifelse(a, b, c) is often better than if a; b else c end or a ? b : c.

Well done. I have done a fair bit of taking code from R to Julia, and I have never seen 400x. A lot of 5x.

But R can also be fast, I am managing an R project that takes days to run on a large AWS instance, and in a codebase of ~10K lines, 95% of the time is 3 lines in R, and the rest of the time consumption is a few lines of data.table. I tried those 3 lines in Julia, and R was faster! data.table was much faster than Julia dataframes. What happens in those 3 lines is linear algebra with small vectors (~20 elements).

I would love to maintain that in Julia, but my boss would laugh at me if I asked.

js

1 Like

Did you try the “Why is R faster than Julia?” approach? :smiley:

2 Likes

Never heard of that! Do elaborate.

This is admittingly a few Julia versions back, but I really wanted Julia to be faster and tried everything I could think of.

1 Like

It is a joke. Start a forum thread with that title and put MWE where that happens. Always works.

(Realistically speaking, linear algebra with small vectors may be faster in R but normally that is solved using static arrays, or switching to MKL or, some times, restructuring the problem in way not possible in R by fusing many sequential vector operations in a loop, sometimes with big gains).

2 Likes

I know!

I always find it funny to see how people react when jokes are taken seriously.

2 Likes

yes, have seen and done that. (well can’t use MKL) Downside is that converting code that runs correctly in 3 lines of R, but need something more complicated in Julia to be faster, may not be justifiable. I like what Knuth said about premature optimization as the root of all evil

3 Likes

If three lines of code account for 95% of days of runtime, it sounds like anything but premature to try to optimize those.

6 Likes

As the others say make a thread with an MWE and we’ll see. Incidentally the R code I was working off of was also using data.table, it just wasn’t the right tool for the job. The h2o benchmarks suggest that DataFrames is very competitive with data.table, so I’d be surprised to see data.table being “much faster” than DataFrames.

1 Like

yes, in a lot of cases yes, and I have ported several R projects to Julia.

But it will depend on length of job and amount of code around it. For this project, my employer would rather rent the AWS 96 core instance running it for several hours more each month than trying to save that money by rewriting. The cost of rewrite will exceed the AWS savings.

yes, I am guilty of exaggeration. when I did the comparison a few versions back it was perhaps 30% time difference. and since I know data.tables much better than dataframes, that probably is only down to my ignorance.

For me, the biggest problem is how different the data.table syntax is to dataframes. which makes it very time consuming and risky to port R code that heavily uses data.tables to Julia.