How to make LoopVectorization work with computing Euclidean distance for every column?

Consider the below. The code works but when I add @avx it gives error

true && (view)(a, :, i)
ERROR: "Expression not recognized."
Stacktrace:
 [1] add_operation!(::LoopVectorization.LoopSet, ::Symbol, ::Expr, ::Int64, ::Int64) at C:\Users\RTX2080\.julia\packages\LoopVectorization\OZUlx\src\graphs.jl:607
 [2] add_parent!(::Array{LoopVectorization.Operation,1}, ::Array{Symbol,1}, ::Array{Symbol,1}, ::LoopVectorization.LoopSet, ::Expr, ::Int64, ::Int64) at C:\Users\RTX2080\.julia\packages\LoopVectorization\OZUlx\src\add_compute.jl:68

How do I make LoopVectorization work for me in this case?


a = rand(Float32, 128, 4153344)
c = rand(Float32, 128)


function comp(a, c)
    dist = zeros(Float32, size(a, 2))

    @inbounds @fastmath for i in axes(a, 2)
        dist[i] = sum(c .* @view(a[:, i]) .^ 2)
    end
    dist
end

@time comp(a, c)

using LoopVectorization
function comp_avx(a, c)
    dist = zeros(Float32, size(a, 2))

    @avx for i in axes(a, 2)
        dist[i] = sum(c .* @view(a[:, i]) .^ 2)
    end
    dist
end

@time comp_avx(a, c)
using Tullio
@tullio dist[i] = c[j] * a[j,i]

It uses @avx internally

2 Likes

Do I need?

@tullio dist[i] = sum((c[j] * a[j,i]) ^2)

I think I just need

@tullio dist[i] = (c[j] - a[j,i]))^2

Wow! It’s naturally 5 times faster! Thanks!


using Tullio
function comp_tullio(a, c)
    dist = zeros(Float32, size(a, 2))
    @tullio dist[i] = (c[j] - a[j,i])^2

    dist
end
@time comp_tullio(a, c)

@benchmark comp_tullio(a, c)

all(comp_tullio(a, c) .β‰ˆ comp(a,c))

For interest, I have posted on SO for a faster py thon solution I can compare to. I am sure R doesn’t have anything like it.

Does anyone have matlab and can do a benchmark?

The cost of your first function is all those slices – while the view is cheap, c .* view makes an array. Writing it out, something like this, avoids them:

               acc = 0f0
               for k in axes(a,1)
                   acc += (c[k] * a[k,i])^2
               end
               dist[i] = acc  

This is what @tullio writes, and also the form @avx is happiest to act on. This one is simple enough that it gets optimised well by Julia alone, I don’t actually see a speedup from @avx.

Note that this is still squared Euclidean distance. And that Distances.jl is pretty fast at these things. I thought these should agree but perhaps I need another coffee:

julia> d2 = Distances.colwise(Euclidean(), c, a);

julia> d2 β‰ˆ sqrt.(comp(a, c))
false

julia> d2 β‰ˆ @tullio d[i] := sqrt <| (c[k] * a[k,i])^2
false

Finally note that you can use := with @tullio to make a new array the right size internally.

1 Like

Does anyone know how I can get it working on the GPU? Seems like Tullio.jl generates some index accessing code.

I might write a proper gpu kernel for the below but this doesn’t work

using Tullio

a = rand(Float32, 128, 4153344)
c = rand(Float32, 128)
function comp_tullio(a, c)
    dist = CuArray(zeros(Float32, size(a, 2)))
    @tullio dist[i] = (c[j] - a[j,i])^2
    dist
end

@time d = comp_tullio(a, c)

using CUDA
CUDA.allowscalar(false)

gpu_a = CuArray(a)
gpu_c = CuArray(c)

@time comp_tullio(gpu_a, gpu_c)

You will need KernelAbstractions.jl for this to work, after which it should write a GPU version.

Otherwise the usual matlab-esque way of doing this is something like this (should agree with Distances.pairwise(SqEuclidean(), x, y; dims=2)):

pairwise2(x::AbstractMatrix, y::AbstractMatrix) = sum(x.*x; dims=1)' .+ sum(y .* y; dims=1) .- 2 .* x'*y
pairwise2(x::AbstractVector, y::AbstractMatrix) = vec(pairwise2(reshape(x,:,1), y))

Performance is also a function of the sizes of the arrays you’re benchmarking.
Functions:

using LoopVectorization

function comp!(dist, a, c)
    @avx for i in axes(a, 2)
        acc = zero(eltype(dist))
        for k in axes(a,1)
            acc += (c[k] * a[k,i])^2
        end
        dist[i] = acc
    end
end

function compllvm!(dist, a, c)
    @inbounds @fastmath for i in axes(a, 2)
        acc = zero(eltype(dist))
        for k in axes(a,1)
            acc += (c[k] * a[k,i])^2
        end
        dist[i] = acc
    end
end
julia> a = rand(Float32, M, N); c = rand(Float32, size(a,1)); d = similar(c, size(a,2));

julia> @benchmark comp!($d, $a, $c)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     162.401 ms (0.00% GC)
  median time:      167.070 ms (0.00% GC)
  mean time:        166.839 ms (0.00% GC)
  maximum time:     167.810 ms (0.00% GC)
  --------------
  samples:          30
  evals/sample:     1

julia> @benchmark compllvm!($d, $a, $c)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     145.021 ms (0.00% GC)
  median time:      145.275 ms (0.00% GC)
  mean time:        146.247 ms (0.00% GC)
  maximum time:     151.646 ms (0.00% GC)
  --------------
  samples:          35
  evals/sample:     1

The LLVM version is faster, because the @avx version iterates across size(a,2) more quickly.
However, note that LLVM is very reliant on multiples of powers of 2. See what happens if we shrink size(a,1) by 1.

julia> a = rand(Float32, M-1, N); c = rand(Float32, size(a,1)); d = similar(c, size(a,2));

julia> @benchmark comp!($d, $a, $c)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     169.176 ms (0.00% GC)
  median time:      169.942 ms (0.00% GC)
  mean time:        171.480 ms (0.00% GC)
  maximum time:     176.977 ms (0.00% GC)
  --------------
  samples:          30
  evals/sample:     1

julia> @benchmark compllvm!($d, $a, $c)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     273.203 ms (0.00% GC)
  median time:      284.456 ms (0.00% GC)
  mean time:        280.057 ms (0.00% GC)
  maximum time:     285.689 ms (0.00% GC)
  --------------
  samples:          18
  evals/sample:     1

Back to the earlier point on quickly moving across size(a,2), performance is also better when this is small:

julia> a = rand(Float32, M, 2M); c = rand(Float32, size(a,1)); d = similar(c, size(a,2));

julia> @benchmark comp!($d, $a, $c)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     1.174 ΞΌs (0.00% GC)
  median time:      1.181 ΞΌs (0.00% GC)
  mean time:        1.187 ΞΌs (0.00% GC)
  maximum time:     2.734 ΞΌs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     10

julia> @benchmark compllvm!($d, $a, $c)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     1.689 ΞΌs (0.00% GC)
  median time:      1.703 ΞΌs (0.00% GC)
  mean time:        1.706 ΞΌs (0.00% GC)
  maximum time:     4.101 ΞΌs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     10

Of course, what matters is what performs best at the size range you’re actually running on.

Some interesting stats to look at (the first table is with LoopVectorization, the second is without):

julia> using LinuxPerf

julia> a = rand(Float32, M, N); c = rand(Float32, size(a,1)); d = similar(c, size(a,2));

julia> foreachf!(f::F, N, args::Vararg{<:Any,A}) where {F,A} = foreach(_ -> f(args...), Base.OneTo(N))
foreachf! (generic function with 1 method)

julia> @pstats "cpu-cycles,(instructions,branch-instructions,branch-misses),(task-clock,context-switches,cpu-migrations,page-faults),(L1-dcache-load-misses,L1-dcache-loads,L1-icache-load-misses),(dTLB-load-misses,dTLB-loads),(iTLB-load-misses,iTLB-loads)" (@time foreachf!(comp!, 30, d, a, c))
  4.842500 seconds (2 allocations: 64 bytes)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
β•Ά cpu-cycles               1.97e+10   60.0%  #  4.1 cycles per ns
β”Œ instructions             4.36e+09   60.0%  #  0.2 insns per cycle
β”‚ branch-instructions      1.71e+08   60.0%  #  3.9% of instructions
β”” branch-misses            7.44e+03   60.0%  #  0.0% of branch instructions
β”Œ task-clock               4.84e+09  100.0%  #  4.8 s
β”‚ context-switches         0.00e+00  100.0%
β”‚ cpu-migrations           0.00e+00  100.0%
β”” page-faults              0.00e+00  100.0%
β”Œ L1-dcache-load-misses    1.01e+09   20.0%  # 88.4% of dcache loads
β”‚ L1-dcache-loads          1.14e+09   20.0%
β”” L1-icache-load-misses    1.68e+04   20.0%
β”Œ dTLB-load-misses         1.45e+03   20.0%  #  0.0% of dTLB loads
β”” dTLB-loads               1.14e+09   20.0%
β”Œ iTLB-load-misses         6.95e+02   40.0%  # 646.5% of iTLB loads
β”” iTLB-loads               1.08e+02   40.0%
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

julia> @pstats "cpu-cycles,(instructions,branch-instructions,branch-misses),(task-clock,context-switches,cpu-migrations,page-faults),(L1-dcache-load-misses,L1-dcache-loads,L1-icache-load-misses),(dTLB-load-misses,dTLB-loads),(iTLB-load-misses,iTLB-loads)" (@time foreachf!(compllvm!, 30, d, a, c))
  4.568975 seconds (2 allocations: 64 bytes)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
β•Ά cpu-cycles               1.86e+10   60.0%  #  4.1 cycles per ns
β”Œ instructions             8.22e+09   60.0%  #  0.4 insns per cycle
β”‚ branch-instructions      6.23e+08   60.0%  #  7.6% of instructions
β”” branch-misses            5.36e+03   60.0%  #  0.0% of branch instructions
β”Œ task-clock               4.57e+09  100.0%  #  4.6 s
β”‚ context-switches         0.00e+00  100.0%
β”‚ cpu-migrations           0.00e+00  100.0%
β”” page-faults              0.00e+00  100.0%
β”Œ L1-dcache-load-misses    1.01e+09   20.0%  # 50.4% of dcache loads
β”‚ L1-dcache-loads          1.99e+09   20.0%
β”” L1-icache-load-misses    1.72e+04   20.0%
β”Œ dTLB-load-misses         4.20e+02   20.0%  #  0.0% of dTLB loads
β”” dTLB-loads               1.99e+09   20.0%
β”Œ iTLB-load-misses         6.55e+02   40.0%  # 708.1% of iTLB loads
β”” iTLB-loads               9.25e+01   40.0%
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

This was the original size where the llvm version was faster. 30 repetitions took 4.8 seconds with LoopVectorization vs 4.6 without.
But LoopVectorization calculated the answer with 4.36e+09 total instructions, while LLVM required nearly twice as many at 8.22e+09.
Of course, what we really care about is runtime, not total instructions used.

This CPU can issue multiple instructions per clock (IPC), but this code ran at only 0.2 and 0.4 IPC for with/without LV. That’s because performance was dominated by L1-dcache-load-misses, where LoopVectorization does a lot worse at the moment.

I plan on eventually having LoopVectorization optimize cache performance as well, but it doesn’t yet. So in problems dominated by it, LoopVectorization might fair poorly even if it does well in reducing the number of instructions required to calculate the answer.

EDIT:
Just for fun, the size(a,1) == 127 example (first table w/ lv, second w/out):

julia> a = rand(Float32, M-1, N); c = rand(Float32, size(a,1)); d = similar(c, size(a,2));

julia> size(a)
(127, 4153344)

julia> @pstats "cpu-cycles,(instructions,branch-instructions,branch-misses),(task-clock,context-switches,cpu-migrations,page-faults),(L1-dcache-load-misses,L1-dcache-loads,L1-icache-load-misses),(dTLB-load-misses,dTLB-loads),(iTLB-load-misses,iTLB-loads)" (@time foreachf!(comp!, 30, d, a, c))
  5.164595 seconds (2 allocations: 64 bytes)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
β•Ά cpu-cycles               2.10e+10   60.0%  #  4.1 cycles per ns
β”Œ instructions             4.42e+09   60.0%  #  0.2 insns per cycle
β”‚ branch-instructions      1.87e+08   60.0%  #  4.2% of instructions
β”” branch-misses            7.48e+03   60.0%  #  0.0% of branch instructions
β”Œ task-clock               5.17e+09  100.0%  #  5.2 s
β”‚ context-switches         0.00e+00  100.0%
β”‚ cpu-migrations           0.00e+00  100.0%
β”” page-faults              0.00e+00  100.0%
β”Œ L1-dcache-load-misses    9.97e+08   20.0%  # 87.7% of dcache loads
β”‚ L1-dcache-loads          1.14e+09   20.0%
β”” L1-icache-load-misses    3.28e+04   20.0%
β”Œ dTLB-load-misses         3.06e+04   20.0%  #  0.0% of dTLB loads
β”” dTLB-loads               1.14e+09   20.0%
β”Œ iTLB-load-misses         4.60e+02   40.0%  # 142.6% of iTLB loads
β”” iTLB-loads               3.23e+02   40.0%
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

julia> @pstats "cpu-cycles,(instructions,branch-instructions,branch-misses),(task-clock,context-switches,cpu-migrations,page-faults),(L1-dcache-load-misses,L1-dcache-loads,L1-icache-load-misses),(dTLB-load-misses,dTLB-loads),(iTLB-load-misses,iTLB-loads)" (@time foreachf!(compllvm!, 30, d, a, c))
  8.513785 seconds (2 allocations: 64 bytes)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
β•Ά cpu-cycles               3.64e+10   60.0%  #  4.3 cycles per ns
β”Œ instructions             5.38e+10   60.0%  #  1.5 insns per cycle
β”‚ branch-instructions      8.47e+09   60.0%  # 15.7% of instructions
β”” branch-misses            1.25e+08   60.0%  #  1.5% of branch instructions
β”Œ task-clock               8.51e+09  100.0%  #  8.5 s
β”‚ context-switches         0.00e+00  100.0%
β”‚ cpu-migrations           0.00e+00  100.0%
β”” page-faults              0.00e+00  100.0%
β”Œ L1-dcache-load-misses    9.98e+08   20.0%  #  6.0% of dcache loads
β”‚ L1-dcache-loads          1.67e+10   20.0%
β”” L1-icache-load-misses    3.63e+04   20.0%
β”Œ dTLB-load-misses         1.67e+04   20.0%  #  0.0% of dTLB loads
β”” dTLB-loads               1.67e+10   20.0%
β”Œ iTLB-load-misses         1.07e+03   40.0%  # 276.1% of iTLB loads
β”” iTLB-loads               3.88e+02   40.0%
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Here LoopVectorization performs slightly better, probably in large part because it uses over an order of magnitude less instructions…

4 Likes

The GPU version is only slightly faster at 48ms

using CUDA
CUDA.allowscalar(false)

function comp_gpu_kernel!(buffer, a, b)
    i = (blockIdx().x - 1) * blockDim().x + threadIdx().x
    stride = gridDim().x * blockDim().x

    for j = i:stride:size(a, 2)
        for i in 1:128
            buffer[j] += (a[i,j] - b[i])^2
        end
    end
    return
end

function comp_gpu(a, b)
    dist = CUDA.zeros(Float32, size(a, 2))
    CUDA.@sync @cuda threads = 256 blocks = 1024 comp_gpu_kernel!(dist, a,  b)
    dist
end

gpu_a = CuArray(a)
gpu_c = CuArray(c)

@time cgpu = comp_gpu(gpu_a, gpu_c)

Using Distances.jl I get similar speed to other at 100ms

using Distances

@benchmark colwise(Euclidean(), a, c)

Out of curiosity, what GPU do you have?
With my CPU:

julia> comptullio!(d, a, c) = @tullio d[i] = (c[k] * a[k,i])^2;

julia> @benchmark comptullio!($d2, $a, $c)
BenchmarkTools.Trial:
  memory estimate:  17.05 KiB
  allocs estimate:  245
  --------------
  minimum time:     25.386 ms (0.00% GC)
  median time:      25.565 ms (0.00% GC)
  mean time:        25.573 ms (0.00% GC)
  maximum time:     25.808 ms (0.00% GC)
  --------------
  samples:          196
  evals/sample:     1

julia> size(a)
(128, 4153344)

julia> versioninfo()
Julia Version 1.5.1-pre.29
Commit 21e0f044b5* (2020-08-19 20:24 UTC)
Platform Info:
  OS: Linux (x86_64-generic-linux)
  CPU: Intel(R) Core(TM) i9-10980XE CPU @ 3.00GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-9.0.1 (ORCJIT, cascadelake)
Environment:
  JULIA_NUM_THREADS = 18

I got a RTX 2080 (no Ti)

You have an i9 too.