Dear Julians,
I want to speed up the following code that iterates over a matrix containing n_examples
vectors of size n_features
. These vectors, stored in PQ
, have eltype
Integer
with values in the range 1 to n_clusters
(here n_clusters
is always a small value, at most 64). These values are used to access a look up table T
containing real numbers.
It seems the code is bottleneck by the array lookups T[ PQ_trans[n,j], j ]
or T[PQ[j,n],j]
.
Any advice on how to make this as fast as possible?
Could this be improved using SIMD Gather instructions to get several values from T
and add them up together with a single instructions ? Maybe a special data structure for T
that has faster access times? Maybe making T
(quantizing it) so that it might fit better into the Cache?
Any help would be really appreciated.
Some important assumptions about this setting:
- We don’t care much about the presicion of elements in T (they coulde be Float16)
- We don’t care about decimal presision much in re results form re returned arrays
d
Here you have a MWE:
using BenchmarkTools
n_clusters = 32
n_examples = 1_000_000
n_features = 128
# Look up table with real numbers
T = rand(n_clusters, n_features);
# Big matrix with indicies
PQ = rand(1:n_clusters, n_features, n_examples)
# Big matrix with indicies
PQ_trans = Matrix(PQ');
function lsh(PQ, T)
n_features, n_examples = size(PQ)
d = zeros(eltype(T), n_examples)
@avx for n in 1:n_examples
res = zero(eltype(T))
for j in 1:n_features
res += T[PQ[j,n],j]
end
d[n] = res
end
return d
end
function lsh_transposed(PQ_trans, T)
n_examples, n_features = size(PQ_trans)
d = zeros(eltype(T), n_examples)
@inbounds @fastmath for j in 1:n_features
for n in 1:n_examples
d[n] += T[ PQ_trans[n,j], j ]
end
end
return d
end
res_lsh_transposed = lsh_transposed(PQ_trans, T);
res_lsh = lsh(PQ, T);
@assert isapprox(res_lsh_transposed, res_lsh)
@benchmark lsh_transposed($PQ_trans, $T)
@benchmark lsh($PQ, $T)
Here I can see lsh
is faster than lsh_transposed
@benchmark lsh_transposed(PQ_trans, T)
BenchmarkTools.Trial: 64 samples with 1 evaluation.
Range (min … max): 77.671 ms … 82.273 ms ┊ GC (min … max): 0.00% … 0.00%
Time (median): 78.273 ms ┊ GC (median): 0.00%
Time (mean ± σ): 78.419 ms ± 725.813 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
@benchmark lsh($PQ, $T)
BenchmarkTools.Trial: 116 samples with 1 evaluation.
Range (min … max): 42.417 ms … 50.309 ms ┊ GC (min … max): 0.00% … 0.00%
Time (median): 43.132 ms ┊ GC (median): 0.00%
Time (mean ± σ): 43.412 ms ± 1.002 ms ┊ GC (mean ± σ): 0.00% ± 0.00%