# How to make LoopVectorization work with computing Euclidean distance for every column?

Consider the below. The code works but when I add `@avx` it gives error

``````true && (view)(a, :, i)
ERROR: "Expression not recognized."
Stacktrace:
 add_operation!(::LoopVectorization.LoopSet, ::Symbol, ::Expr, ::Int64, ::Int64) at C:\Users\RTX2080\.julia\packages\LoopVectorization\OZUlx\src\graphs.jl:607
``````

How do I make LoopVectorization work for me in this case?

``````
a = rand(Float32, 128, 4153344)
c = rand(Float32, 128)

function comp(a, c)
dist = zeros(Float32, size(a, 2))

@inbounds @fastmath for i in axes(a, 2)
dist[i] = sum(c .* @view(a[:, i]) .^ 2)
end
dist
end

@time comp(a, c)

using LoopVectorization
function comp_avx(a, c)
dist = zeros(Float32, size(a, 2))

@avx for i in axes(a, 2)
dist[i] = sum(c .* @view(a[:, i]) .^ 2)
end
dist
end

@time comp_avx(a, c)
``````
``````using Tullio
@tullio dist[i] = c[j] * a[j,i]
``````

It uses @avx internally

2 Likes

Do I need?

`@tullio dist[i] = sum((c[j] * a[j,i]) ^2)`

I think I just need

`@tullio dist[i] = (c[j] - a[j,i]))^2`

Wow! It’s naturally 5 times faster! Thanks!

``````
using Tullio
function comp_tullio(a, c)
dist = zeros(Float32, size(a, 2))
@tullio dist[i] = (c[j] - a[j,i])^2

dist
end
@time comp_tullio(a, c)

@benchmark comp_tullio(a, c)

all(comp_tullio(a, c) .≈ comp(a,c))
``````

For interest, I have posted on SO for a faster py thon solution I can compare to. I am sure R doesn’t have anything like it.

Does anyone have matlab and can do a benchmark?

The cost of your first function is all those slices – while the view is cheap, `c .* view` makes an array. Writing it out, something like this, avoids them:

``````               acc = 0f0
for k in axes(a,1)
acc += (c[k] * a[k,i])^2
end
dist[i] = acc
``````

This is what `@tullio` writes, and also the form `@avx` is happiest to act on. This one is simple enough that it gets optimised well by Julia alone, I don’t actually see a speedup from `@avx`.

Note that this is still squared Euclidean distance. And that Distances.jl is pretty fast at these things. I thought these should agree but perhaps I need another coffee:

``````julia> d2 = Distances.colwise(Euclidean(), c, a);

julia> d2 ≈ sqrt.(comp(a, c))
false

julia> d2 ≈ @tullio d[i] := sqrt <| (c[k] * a[k,i])^2
false
``````

Finally note that you can use `:=` with `@tullio` to make a new array the right size internally.

1 Like

Does anyone know how I can get it working on the GPU? Seems like Tullio.jl generates some index accessing code.

I might write a proper gpu kernel for the below but this doesn’t work

``````using Tullio

a = rand(Float32, 128, 4153344)
c = rand(Float32, 128)
function comp_tullio(a, c)
dist = CuArray(zeros(Float32, size(a, 2)))
@tullio dist[i] = (c[j] - a[j,i])^2
dist
end

@time d = comp_tullio(a, c)

using CUDA
CUDA.allowscalar(false)

gpu_a = CuArray(a)
gpu_c = CuArray(c)

@time comp_tullio(gpu_a, gpu_c)
``````

You will need KernelAbstractions.jl for this to work, after which it should write a GPU version.

Otherwise the usual matlab-esque way of doing this is something like this (should agree with `Distances.pairwise(SqEuclidean(), x, y; dims=2)`):

``````pairwise2(x::AbstractMatrix, y::AbstractMatrix) = sum(x.*x; dims=1)' .+ sum(y .* y; dims=1) .- 2 .* x'*y
pairwise2(x::AbstractVector, y::AbstractMatrix) = vec(pairwise2(reshape(x,:,1), y))
``````

Performance is also a function of the sizes of the arrays you’re benchmarking.
Functions:

``````using LoopVectorization

function comp!(dist, a, c)
@avx for i in axes(a, 2)
acc = zero(eltype(dist))
for k in axes(a,1)
acc += (c[k] * a[k,i])^2
end
dist[i] = acc
end
end

function compllvm!(dist, a, c)
@inbounds @fastmath for i in axes(a, 2)
acc = zero(eltype(dist))
for k in axes(a,1)
acc += (c[k] * a[k,i])^2
end
dist[i] = acc
end
end
``````
``````julia> a = rand(Float32, M, N); c = rand(Float32, size(a,1)); d = similar(c, size(a,2));

julia> @benchmark comp!(\$d, \$a, \$c)
BenchmarkTools.Trial:
memory estimate:  0 bytes
allocs estimate:  0
--------------
minimum time:     162.401 ms (0.00% GC)
median time:      167.070 ms (0.00% GC)
mean time:        166.839 ms (0.00% GC)
maximum time:     167.810 ms (0.00% GC)
--------------
samples:          30
evals/sample:     1

julia> @benchmark compllvm!(\$d, \$a, \$c)
BenchmarkTools.Trial:
memory estimate:  0 bytes
allocs estimate:  0
--------------
minimum time:     145.021 ms (0.00% GC)
median time:      145.275 ms (0.00% GC)
mean time:        146.247 ms (0.00% GC)
maximum time:     151.646 ms (0.00% GC)
--------------
samples:          35
evals/sample:     1
``````

The LLVM version is faster, because the `@avx` version iterates across `size(a,2)` more quickly.
However, note that LLVM is very reliant on multiples of powers of 2. See what happens if we shrink `size(a,1)` by 1.

``````julia> a = rand(Float32, M-1, N); c = rand(Float32, size(a,1)); d = similar(c, size(a,2));

julia> @benchmark comp!(\$d, \$a, \$c)
BenchmarkTools.Trial:
memory estimate:  0 bytes
allocs estimate:  0
--------------
minimum time:     169.176 ms (0.00% GC)
median time:      169.942 ms (0.00% GC)
mean time:        171.480 ms (0.00% GC)
maximum time:     176.977 ms (0.00% GC)
--------------
samples:          30
evals/sample:     1

julia> @benchmark compllvm!(\$d, \$a, \$c)
BenchmarkTools.Trial:
memory estimate:  0 bytes
allocs estimate:  0
--------------
minimum time:     273.203 ms (0.00% GC)
median time:      284.456 ms (0.00% GC)
mean time:        280.057 ms (0.00% GC)
maximum time:     285.689 ms (0.00% GC)
--------------
samples:          18
evals/sample:     1
``````

Back to the earlier point on quickly moving across `size(a,2)`, performance is also better when this is small:

``````julia> a = rand(Float32, M, 2M); c = rand(Float32, size(a,1)); d = similar(c, size(a,2));

julia> @benchmark comp!(\$d, \$a, \$c)
BenchmarkTools.Trial:
memory estimate:  0 bytes
allocs estimate:  0
--------------
minimum time:     1.174 μs (0.00% GC)
median time:      1.181 μs (0.00% GC)
mean time:        1.187 μs (0.00% GC)
maximum time:     2.734 μs (0.00% GC)
--------------
samples:          10000
evals/sample:     10

julia> @benchmark compllvm!(\$d, \$a, \$c)
BenchmarkTools.Trial:
memory estimate:  0 bytes
allocs estimate:  0
--------------
minimum time:     1.689 μs (0.00% GC)
median time:      1.703 μs (0.00% GC)
mean time:        1.706 μs (0.00% GC)
maximum time:     4.101 μs (0.00% GC)
--------------
samples:          10000
evals/sample:     10
``````

Of course, what matters is what performs best at the size range you’re actually running on.

Some interesting stats to look at (the first table is with LoopVectorization, the second is without):

``````julia> using LinuxPerf

julia> a = rand(Float32, M, N); c = rand(Float32, size(a,1)); d = similar(c, size(a,2));

julia> foreachf!(f::F, N, args::Vararg{<:Any,A}) where {F,A} = foreach(_ -> f(args...), Base.OneTo(N))
foreachf! (generic function with 1 method)

4.842500 seconds (2 allocations: 64 bytes)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
╶ cpu-cycles               1.97e+10   60.0%  #  4.1 cycles per ns
┌ instructions             4.36e+09   60.0%  #  0.2 insns per cycle
│ branch-instructions      1.71e+08   60.0%  #  3.9% of instructions
└ branch-misses            7.44e+03   60.0%  #  0.0% of branch instructions
┌ task-clock               4.84e+09  100.0%  #  4.8 s
│ context-switches         0.00e+00  100.0%
│ cpu-migrations           0.00e+00  100.0%
└ page-faults              0.00e+00  100.0%
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

4.568975 seconds (2 allocations: 64 bytes)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
╶ cpu-cycles               1.86e+10   60.0%  #  4.1 cycles per ns
┌ instructions             8.22e+09   60.0%  #  0.4 insns per cycle
│ branch-instructions      6.23e+08   60.0%  #  7.6% of instructions
└ branch-misses            5.36e+03   60.0%  #  0.0% of branch instructions
┌ task-clock               4.57e+09  100.0%  #  4.6 s
│ context-switches         0.00e+00  100.0%
│ cpu-migrations           0.00e+00  100.0%
└ page-faults              0.00e+00  100.0%
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
``````

This was the original size where the llvm version was faster. 30 repetitions took 4.8 seconds with LoopVectorization vs 4.6 without.
But LoopVectorization calculated the answer with `4.36e+09` total instructions, while LLVM required nearly twice as many at `8.22e+09`.
Of course, what we really care about is runtime, not total instructions used.

This CPU can issue multiple instructions per clock (IPC), but this code ran at only 0.2 and 0.4 IPC for with/without LV. That’s because performance was dominated by `L1-dcache-load-misses`, where LoopVectorization does a lot worse at the moment.

I plan on eventually having LoopVectorization optimize cache performance as well, but it doesn’t yet. So in problems dominated by it, LoopVectorization might fair poorly even if it does well in reducing the number of instructions required to calculate the answer.

EDIT:
Just for fun, the `size(a,1) == 127` example (first table w/ lv, second w/out):

``````julia> a = rand(Float32, M-1, N); c = rand(Float32, size(a,1)); d = similar(c, size(a,2));

julia> size(a)
(127, 4153344)

5.164595 seconds (2 allocations: 64 bytes)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
╶ cpu-cycles               2.10e+10   60.0%  #  4.1 cycles per ns
┌ instructions             4.42e+09   60.0%  #  0.2 insns per cycle
│ branch-instructions      1.87e+08   60.0%  #  4.2% of instructions
└ branch-misses            7.48e+03   60.0%  #  0.0% of branch instructions
┌ task-clock               5.17e+09  100.0%  #  5.2 s
│ context-switches         0.00e+00  100.0%
│ cpu-migrations           0.00e+00  100.0%
└ page-faults              0.00e+00  100.0%
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

8.513785 seconds (2 allocations: 64 bytes)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
╶ cpu-cycles               3.64e+10   60.0%  #  4.3 cycles per ns
┌ instructions             5.38e+10   60.0%  #  1.5 insns per cycle
│ branch-instructions      8.47e+09   60.0%  # 15.7% of instructions
└ branch-misses            1.25e+08   60.0%  #  1.5% of branch instructions
┌ task-clock               8.51e+09  100.0%  #  8.5 s
│ context-switches         0.00e+00  100.0%
│ cpu-migrations           0.00e+00  100.0%
└ page-faults              0.00e+00  100.0%
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
``````

Here LoopVectorization performs slightly better, probably in large part because it uses over an order of magnitude less instructions…

4 Likes

The GPU version is only slightly faster at `48ms`

``````using CUDA
CUDA.allowscalar(false)

function comp_gpu_kernel!(buffer, a, b)
i = (blockIdx().x - 1) * blockDim().x + threadIdx().x
stride = gridDim().x * blockDim().x

for j = i:stride:size(a, 2)
for i in 1:128
buffer[j] += (a[i,j] - b[i])^2
end
end
return
end

function comp_gpu(a, b)
dist = CUDA.zeros(Float32, size(a, 2))
CUDA.@sync @cuda threads = 256 blocks = 1024 comp_gpu_kernel!(dist, a,  b)
dist
end

gpu_a = CuArray(a)
gpu_c = CuArray(c)

@time cgpu = comp_gpu(gpu_a, gpu_c)
``````

Using Distances.jl I get similar speed to other at 100ms

``````using Distances

@benchmark colwise(Euclidean(), a, c)
``````

Out of curiosity, what GPU do you have?
With my CPU:

``````julia> comptullio!(d, a, c) = @tullio d[i] = (c[k] * a[k,i])^2;

julia> @benchmark comptullio!(\$d2, \$a, \$c)
BenchmarkTools.Trial:
memory estimate:  17.05 KiB
allocs estimate:  245
--------------
minimum time:     25.386 ms (0.00% GC)
median time:      25.565 ms (0.00% GC)
mean time:        25.573 ms (0.00% GC)
maximum time:     25.808 ms (0.00% GC)
--------------
samples:          196
evals/sample:     1

julia> size(a)
(128, 4153344)

julia> versioninfo()
Julia Version 1.5.1-pre.29
Commit 21e0f044b5* (2020-08-19 20:24 UTC)
Platform Info:
OS: Linux (x86_64-generic-linux)
CPU: Intel(R) Core(TM) i9-10980XE CPU @ 3.00GHz
WORD_SIZE: 64
LIBM: libopenlibm