Dear Julians,

I want to speed up the following code that iterates over a matrix containing `n_examples`

vectors of size `n_features`

. These vectors, stored in `PQ`

, have `eltype`

`Integer`

with values in the range 1 to `n_clusters`

(here `n_clusters`

is always a small value, at most 64). These values are used to access a look up table `T`

containing real numbers.

It seems the code is bottleneck by the array lookups `T[ PQ_trans[n,j], j ] `

or `T[PQ[j,n],j]`

.

Any advice on how to make this as fast as possible?

Could this be improved using SIMD Gather instructions to get several values from `T`

and add them up together with a single instructions ? Maybe a special data structure for `T`

that has faster access times? Maybe making `T`

(quantizing it) so that it might fit better into the Cache?

Any help would be really appreciated.

Some important assumptions about this setting:

- We don’t care much about the presicion of elements in T (they coulde be Float16)
- We don’t care about decimal presision much in re results form re returned arrays
`d`

Here you have a MWE:

```
using BenchmarkTools
n_clusters = 32
n_examples = 1_000_000
n_features = 128
# Look up table with real numbers
T = rand(n_clusters, n_features);
# Big matrix with indicies
PQ = rand(1:n_clusters, n_features, n_examples)
# Big matrix with indicies
PQ_trans = Matrix(PQ');
function lsh(PQ, T)
n_features, n_examples = size(PQ)
d = zeros(eltype(T), n_examples)
@avx for n in 1:n_examples
res = zero(eltype(T))
for j in 1:n_features
res += T[PQ[j,n],j]
end
d[n] = res
end
return d
end
function lsh_transposed(PQ_trans, T)
n_examples, n_features = size(PQ_trans)
d = zeros(eltype(T), n_examples)
@inbounds @fastmath for j in 1:n_features
for n in 1:n_examples
d[n] += T[ PQ_trans[n,j], j ]
end
end
return d
end
res_lsh_transposed = lsh_transposed(PQ_trans, T);
res_lsh = lsh(PQ, T);
@assert isapprox(res_lsh_transposed, res_lsh)
@benchmark lsh_transposed($PQ_trans, $T)
@benchmark lsh($PQ, $T)
```

Here I can see `lsh`

is faster than `lsh_transposed`

```
@benchmark lsh_transposed(PQ_trans, T)
BenchmarkTools.Trial: 64 samples with 1 evaluation.
Range (min … max): 77.671 ms … 82.273 ms ┊ GC (min … max): 0.00% … 0.00%
Time (median): 78.273 ms ┊ GC (median): 0.00%
Time (mean ± σ): 78.419 ms ± 725.813 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
```

```
@benchmark lsh($PQ, $T)
BenchmarkTools.Trial: 116 samples with 1 evaluation.
Range (min … max): 42.417 ms … 50.309 ms ┊ GC (min … max): 0.00% … 0.00%
Time (median): 43.132 ms ┊ GC (median): 0.00%
Time (mean ± σ): 43.412 ms ± 1.002 ms ┊ GC (mean ± σ): 0.00% ± 0.00%
```