I want to speed up the following code that iterates over a matrix containing
n_examples vectors of size
n_features. These vectors, stored in
Integer with values in the range 1 to
n_clusters is always a small value, at most 64). These values are used to access a look up table
T containing real numbers.
It seems the code is bottleneck by the array lookups
T[ PQ_trans[n,j], j ] or
Any advice on how to make this as fast as possible?
Could this be improved using SIMD Gather instructions to get several values from
T and add them up together with a single instructions ? Maybe a special data structure for
T that has faster access times? Maybe making
T (quantizing it) so that it might fit better into the Cache?
Any help would be really appreciated.
Some important assumptions about this setting:
- We don’t care much about the presicion of elements in T (they coulde be Float16)
- We don’t care about decimal presision much in re results form re returned arrays
Here you have a MWE:
using BenchmarkTools n_clusters = 32 n_examples = 1_000_000 n_features = 128 # Look up table with real numbers T = rand(n_clusters, n_features); # Big matrix with indicies PQ = rand(1:n_clusters, n_features, n_examples) # Big matrix with indicies PQ_trans = Matrix(PQ'); function lsh(PQ, T) n_features, n_examples = size(PQ) d = zeros(eltype(T), n_examples) @avx for n in 1:n_examples res = zero(eltype(T)) for j in 1:n_features res += T[PQ[j,n],j] end d[n] = res end return d end function lsh_transposed(PQ_trans, T) n_examples, n_features = size(PQ_trans) d = zeros(eltype(T), n_examples) @inbounds @fastmath for j in 1:n_features for n in 1:n_examples d[n] += T[ PQ_trans[n,j], j ] end end return d end res_lsh_transposed = lsh_transposed(PQ_trans, T); res_lsh = lsh(PQ, T); @assert isapprox(res_lsh_transposed, res_lsh) @benchmark lsh_transposed($PQ_trans, $T) @benchmark lsh($PQ, $T)
Here I can see
lsh is faster than
@benchmark lsh_transposed(PQ_trans, T) BenchmarkTools.Trial: 64 samples with 1 evaluation. Range (min … max): 77.671 ms … 82.273 ms ┊ GC (min … max): 0.00% … 0.00% Time (median): 78.273 ms ┊ GC (median): 0.00% Time (mean ± σ): 78.419 ms ± 725.813 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
@benchmark lsh($PQ, $T) BenchmarkTools.Trial: 116 samples with 1 evaluation. Range (min … max): 42.417 ms … 50.309 ms ┊ GC (min … max): 0.00% … 0.00% Time (median): 43.132 ms ┊ GC (median): 0.00% Time (mean ± σ): 43.412 ms ± 1.002 ms ┊ GC (mean ± σ): 0.00% ± 0.00%