# Advice for efficient code containing many lookup table access values

Dear Julians,

I want to speed up the following code that iterates over a matrix containing `n_examples` vectors of size `n_features`. These vectors, stored in `PQ`, have `eltype` `Integer` with values in the range 1 to `n_clusters` (here `n_clusters` is always a small value, at most 64). These values are used to access a look up table `T` containing real numbers.

It seems the code is bottleneck by the array lookups `T[ PQ_trans[n,j], j ] ` or `T[PQ[j,n],j]`.
Any advice on how to make this as fast as possible?

Could this be improved using SIMD Gather instructions to get several values from `T` and add them up together with a single instructions ? Maybe a special data structure for `T` that has faster access times? Maybe making `T` (quantizing it) so that it might fit better into the Cache?

Any help would be really appreciated.

• We don’t care much about the presicion of elements in T (they coulde be Float16)
• We don’t care about decimal presision much in re results form re returned arrays `d`

Here you have a MWE:

``````using BenchmarkTools

n_clusters = 32
n_examples = 1_000_000
n_features = 128

# Look up table with real numbers
T = rand(n_clusters, n_features);
# Big matrix with indicies
PQ = rand(1:n_clusters, n_features, n_examples)
# Big matrix with indicies
PQ_trans =  Matrix(PQ');

function lsh(PQ, T)

n_features, n_examples = size(PQ)
d = zeros(eltype(T), n_examples)

@avx for n in 1:n_examples
res = zero(eltype(T))
for j in 1:n_features
res += T[PQ[j,n],j]
end
d[n] = res
end
return d
end

function lsh_transposed(PQ_trans, T)

n_examples, n_features = size(PQ_trans)
d = zeros(eltype(T), n_examples)
@inbounds @fastmath for j in 1:n_features
for n in 1:n_examples
d[n] += T[ PQ_trans[n,j], j ]
end
end
return d
end

res_lsh_transposed = lsh_transposed(PQ_trans, T);
res_lsh = lsh(PQ, T);
@assert isapprox(res_lsh_transposed, res_lsh)

@benchmark lsh_transposed(\$PQ_trans, \$T)
@benchmark lsh(\$PQ, \$T)
``````

Here I can see `lsh` is faster than `lsh_transposed`

``````@benchmark lsh_transposed(PQ_trans, T)
BenchmarkTools.Trial: 64 samples with 1 evaluation.
Range (min … max):  77.671 ms …  82.273 ms  ┊ GC (min … max): 0.00% … 0.00%
Time  (median):     78.273 ms               ┊ GC (median):    0.00%
Time  (mean ± σ):   78.419 ms ± 725.813 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%
``````
``````@benchmark lsh(\$PQ, \$T)
BenchmarkTools.Trial: 116 samples with 1 evaluation.
Range (min … max):  42.417 ms … 50.309 ms  ┊ GC (min … max): 0.00% … 0.00%
Time  (median):     43.132 ms              ┊ GC (median):    0.00%
Time  (mean ± σ):   43.412 ms ±  1.002 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%
``````

This is a tricky one your code seems already very clean.

This was a bit faster on my computer

``````using StaticArrays

PQ = [MVector{n_features,UInt8}(rand(1:n_clusters, n_features)) for i in 1:n_examples]

function lsh(PQ, T)
n_features, n_examples = length(PQ), length(PQ)
d = Array{eltype(T)}(undef, n_examples)
@inbounds for n in 1:n_examples
res = zero(eltype(T))
p = PQ[n]
@simd for j in 1:n_features
res += T[p[j],j]
end
d[n] = res
end
return d
end
``````

It also looks like the kind of setup where GPUs or multi-threading could shine…

(Side note, for the BenchmarkTools it might be good to interpolate the input to keep timings more consistent, e.g. `@btime lsh(\$PQ,\$T)`.)

Thanks for your tip, I am running on an apple silicon chip and your version seems a bit slower than the `lsh` provided above. I will try in a x86.

How about preallocating the result array? Would that make sense for you?

I am no sure what you mean but here arrays could be considered inmutables or constants.

I mean that instead of creating a new `d` each time and returning it from the function, you could pass `d` as a parameter and mutate it. This would presumably ease the load on the GC.

That could be also in the solution yes. But the impact of allocating d seems quite small on the overall computation.

Oh, I see you added `@avx`. I was comparing to the previous verion. `LoopVectorization`does of course kind of the same as the static arrays maybe even faster Did you try `@tturbo` as well, on my machine (x86 with 6 cores) it was worth it… Also using `UInt8` instead of `Int64`.