Advice for efficient code containing many lookup table access values

Dear Julians,

I want to speed up the following code that iterates over a matrix containing n_examples vectors of size n_features. These vectors, stored in PQ, have eltype Integer with values in the range 1 to n_clusters (here n_clusters is always a small value, at most 64). These values are used to access a look up table T containing real numbers.

It seems the code is bottleneck by the array lookups T[ PQ_trans[n,j], j ] or T[PQ[j,n],j].
Any advice on how to make this as fast as possible?

Could this be improved using SIMD Gather instructions to get several values from T and add them up together with a single instructions ? Maybe a special data structure for T that has faster access times? Maybe making T (quantizing it) so that it might fit better into the Cache?

Any help would be really appreciated.

Some important assumptions about this setting:

  • We don’t care much about the presicion of elements in T (they coulde be Float16)
  • We don’t care about decimal presision much in re results form re returned arrays d

Here you have a MWE:

using BenchmarkTools

n_clusters = 32
n_examples = 1_000_000
n_features = 128

# Look up table with real numbers
T = rand(n_clusters, n_features);                                
# Big matrix with indicies
PQ = rand(1:n_clusters, n_features, n_examples)      
# Big matrix with indicies
PQ_trans =  Matrix(PQ');                                            

function lsh(PQ, T)
    
    n_features, n_examples = size(PQ)
    d = zeros(eltype(T), n_examples)
    
    @avx for n in 1:n_examples
        res = zero(eltype(T))
        for j in 1:n_features
            res += T[PQ[j,n],j]    
        end
        d[n] = res
    end
    return d
end


function lsh_transposed(PQ_trans, T)
    
    n_examples, n_features = size(PQ_trans)
    d = zeros(eltype(T), n_examples)
    @inbounds @fastmath for j in 1:n_features
        for n in 1:n_examples
            d[n] += T[ PQ_trans[n,j], j ]    
        end
    end
    return d
end

res_lsh_transposed = lsh_transposed(PQ_trans, T);
res_lsh = lsh(PQ, T);
@assert isapprox(res_lsh_transposed, res_lsh)

@benchmark lsh_transposed($PQ_trans, $T)
@benchmark lsh($PQ, $T)

Here I can see lsh is faster than lsh_transposed

@benchmark lsh_transposed(PQ_trans, T)
BenchmarkTools.Trial: 64 samples with 1 evaluation.
 Range (min … max):  77.671 ms …  82.273 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     78.273 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   78.419 ms ± 725.813 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%
@benchmark lsh($PQ, $T)
BenchmarkTools.Trial: 116 samples with 1 evaluation.
 Range (min … max):  42.417 ms … 50.309 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     43.132 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   43.412 ms ±  1.002 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%

This is a tricky one :smiley: your code seems already very clean.

This was a bit faster on my computer

using StaticArrays

PQ = [MVector{n_features,UInt8}(rand(1:n_clusters, n_features)) for i in 1:n_examples]

function lsh(PQ, T)
     n_features, n_examples = length(PQ[1]), length(PQ)
     d = Array{eltype(T)}(undef, n_examples)
     @inbounds for n in 1:n_examples
         res = zero(eltype(T))
         p = PQ[n]
         @simd for j in 1:n_features
             res += T[p[j],j]
         end
         d[n] = res
     end
     return d
 end

It also looks like the kind of setup where GPUs or multi-threading could shine…

(Side note, for the BenchmarkTools it might be good to interpolate the input to keep timings more consistent, e.g. @btime lsh($PQ,$T).)

Thanks for your tip, I am running on an apple silicon chip and your version seems a bit slower than the lsh provided above. I will try in a x86.

How about preallocating the result array? Would that make sense for you?

I am no sure what you mean but here arrays could be considered inmutables or constants.

I mean that instead of creating a new d each time and returning it from the function, you could pass d as a parameter and mutate it. This would presumably ease the load on the GC.

That could be also in the solution yes. But the impact of allocating d seems quite small on the overall computation.

Oh, I see you added @avx. I was comparing to the previous verion. LoopVectorizationdoes of course kind of the same as the static arrays maybe even faster :wink: Did you try @tturbo as well, on my machine (x86 with 6 cores) it was worth it… Also using UInt8 instead of Int64.

The name of the game is to linearize the memory accesses here.

On my machine,

julia> @benchmark lsh_transposed($PQ_trans, $T)
BenchmarkTools.Trial: 28 samples with 1 evaluation.
 Range (min … max):  166.228 ms … 201.367 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     182.380 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   183.216 ms ±  11.716 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ██  ▁▁     ▁▁     ▁▁▁▁▁    ▁▁ ▁      ▁ ▁      ▁▁▁█   █ ▁  ▁ ▁  
  ██▁▁██▁▁▁▁▁██▁▁▁▁▁█████▁▁▁▁██▁█▁▁▁▁▁▁█▁█▁▁▁▁▁▁████▁▁▁█▁█▁▁█▁█ ▁
  166 ms           Histogram: frequency by time          201 ms <

 Memory estimate: 7.63 MiB, allocs estimate: 2.

julia> @benchmark lsh($PQ, $T)
BenchmarkTools.Trial: 33 samples with 1 evaluation.
 Range (min … max):  149.170 ms … 160.551 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     153.497 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   154.076 ms ±   2.634 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▁   ▁     ▁█  █ ▁█▁ ██▁▁▁██   ▁   ▁ ▁▁▁   █▁    ▁     ▁     ▁  
  █▁▁▁█▁▁▁▁▁██▁▁█▁███▁███████▁▁▁█▁▁▁█▁███▁▁▁██▁▁▁▁█▁▁▁▁▁█▁▁▁▁▁█ ▁
  149 ms           Histogram: frequency by time          161 ms <

 Memory estimate: 7.63 MiB, allocs estimate: 2.

However, if we can pre-process the columns of PQ to be sorted before hand:

But if I make PQ an array of Int8 (as you stated they will not exceed 64:

11.46ms #lsh_transposed
4.89ms #lsh

Albeit the @avx is doing something to the code and it’s no longer correct :confused:

I would recommend pre-processing the array acceses lazily - CartesianIndex is a nice help here, but I couldn’t get it to go any faster.

In the limit, if many of the entries of PQ end up being repeated, you can use a sort of count-map and save on memory accesses, but that will depend on your data.

Spent way too long on this, I’m giving up for now ;_;.

I think you will also be interested in sortperm and invperm.