How to make LoopVectorization work with computing Euclidean distance for every column?

xiaodai · August 21, 2020, 5:32am

Consider the below. The code works but when I add @avx it gives error

true && (view)(a, :, i)
ERROR: "Expression not recognized."
Stacktrace:
 [1] add_operation!(::LoopVectorization.LoopSet, ::Symbol, ::Expr, ::Int64, ::Int64) at C:\Users\RTX2080\.julia\packages\LoopVectorization\OZUlx\src\graphs.jl:607
 [2] add_parent!(::Array{LoopVectorization.Operation,1}, ::Array{Symbol,1}, ::Array{Symbol,1}, ::LoopVectorization.LoopSet, ::Expr, ::Int64, ::Int64) at C:\Users\RTX2080\.julia\packages\LoopVectorization\OZUlx\src\add_compute.jl:68

How do I make LoopVectorization work for me in this case?


a = rand(Float32, 128, 4153344)
c = rand(Float32, 128)


function comp(a, c)
    dist = zeros(Float32, size(a, 2))

    @inbounds @fastmath for i in axes(a, 2)
        dist[i] = sum(c .* @view(a[:, i]) .^ 2)
    end
    dist
end

@time comp(a, c)

using LoopVectorization
function comp_avx(a, c)
    dist = zeros(Float32, size(a, 2))

    @avx for i in axes(a, 2)
        dist[i] = sum(c .* @view(a[:, i]) .^ 2)
    end
    dist
end

@time comp_avx(a, c)

dlakelan · August 21, 2020, 5:39am

using Tullio
@tullio dist[i] = c[j] * a[j,i]

It uses @avx internally

xiaodai · August 21, 2020, 5:48am

Do I need?

@tullio dist[i] = sum((c[j] * a[j,i]) ^2)

I think I just need

@tullio dist[i] = (c[j] - a[j,i]))^2

xiaodai · August 21, 2020, 5:53am

Wow! It’s naturally 5 times faster! Thanks!


using Tullio
function comp_tullio(a, c)
    dist = zeros(Float32, size(a, 2))
    @tullio dist[i] = (c[j] - a[j,i])^2

    dist
end
@time comp_tullio(a, c)

@benchmark comp_tullio(a, c)

all(comp_tullio(a, c) .≈ comp(a,c))

xiaodai · August 21, 2020, 6:08am

For interest, I have posted on SO for a faster py thon solution I can compare to. I am sure R doesn’t have anything like it.

Does anyone have matlab and can do a benchmark?

mcabbott · August 21, 2020, 6:24am

The cost of your first function is all those slices – while the view is cheap, c .* view makes an array. Writing it out, something like this, avoids them:

               acc = 0f0
               for k in axes(a,1)
                   acc += (c[k] * a[k,i])^2
               end
               dist[i] = acc

This is what @tullio writes, and also the form @avx is happiest to act on. This one is simple enough that it gets optimised well by Julia alone, I don’t actually see a speedup from @avx.

Note that this is still squared Euclidean distance. And that Distances.jl is pretty fast at these things. I thought these should agree but perhaps I need another coffee:

julia> d2 = Distances.colwise(Euclidean(), c, a);

julia> d2 ≈ sqrt.(comp(a, c))
false

julia> d2 ≈ @tullio d[i] := sqrt <| (c[k] * a[k,i])^2
false

Finally note that you can use := with @tullio to make a new array the right size internally.

xiaodai · August 21, 2020, 7:41am

Does anyone know how I can get it working on the GPU? Seems like Tullio.jl generates some index accessing code.

I might write a proper gpu kernel for the below but this doesn’t work

using Tullio

a = rand(Float32, 128, 4153344)
c = rand(Float32, 128)
function comp_tullio(a, c)
    dist = CuArray(zeros(Float32, size(a, 2)))
    @tullio dist[i] = (c[j] - a[j,i])^2
    dist
end

@time d = comp_tullio(a, c)

using CUDA
CUDA.allowscalar(false)

gpu_a = CuArray(a)
gpu_c = CuArray(c)

@time comp_tullio(gpu_a, gpu_c)

mcabbott · August 21, 2020, 7:54am

You will need KernelAbstractions.jl for this to work, after which it should write a GPU version.

Otherwise the usual matlab-esque way of doing this is something like this (should agree with Distances.pairwise(SqEuclidean(), x, y; dims=2)):

pairwise2(x::AbstractMatrix, y::AbstractMatrix) = sum(x.*x; dims=1)' .+ sum(y .* y; dims=1) .- 2 .* x'*y
pairwise2(x::AbstractVector, y::AbstractMatrix) = vec(pairwise2(reshape(x,:,1), y))

Elrod · August 21, 2020, 8:33am

Performance is also a function of the sizes of the arrays you’re benchmarking.
Functions:

using LoopVectorization

function comp!(dist, a, c)
    @avx for i in axes(a, 2)
        acc = zero(eltype(dist))
        for k in axes(a,1)
            acc += (c[k] * a[k,i])^2
        end
        dist[i] = acc
    end
end

function compllvm!(dist, a, c)
    @inbounds @fastmath for i in axes(a, 2)
        acc = zero(eltype(dist))
        for k in axes(a,1)
            acc += (c[k] * a[k,i])^2
        end
        dist[i] = acc
    end
end

julia> a = rand(Float32, M, N); c = rand(Float32, size(a,1)); d = similar(c, size(a,2));

julia> @benchmark comp!($d, $a, $c)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     162.401 ms (0.00% GC)
  median time:      167.070 ms (0.00% GC)
  mean time:        166.839 ms (0.00% GC)
  maximum time:     167.810 ms (0.00% GC)
  --------------
  samples:          30
  evals/sample:     1

julia> @benchmark compllvm!($d, $a, $c)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     145.021 ms (0.00% GC)
  median time:      145.275 ms (0.00% GC)
  mean time:        146.247 ms (0.00% GC)
  maximum time:     151.646 ms (0.00% GC)
  --------------
  samples:          35
  evals/sample:     1

The LLVM version is faster, because the @avx version iterates across size(a,2) more quickly.
However, note that LLVM is very reliant on multiples of powers of 2. See what happens if we shrink size(a,1) by 1.

julia> a = rand(Float32, M-1, N); c = rand(Float32, size(a,1)); d = similar(c, size(a,2));

julia> @benchmark comp!($d, $a, $c)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     169.176 ms (0.00% GC)
  median time:      169.942 ms (0.00% GC)
  mean time:        171.480 ms (0.00% GC)
  maximum time:     176.977 ms (0.00% GC)
  --------------
  samples:          30
  evals/sample:     1

julia> @benchmark compllvm!($d, $a, $c)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     273.203 ms (0.00% GC)
  median time:      284.456 ms (0.00% GC)
  mean time:        280.057 ms (0.00% GC)
  maximum time:     285.689 ms (0.00% GC)
  --------------
  samples:          18
  evals/sample:     1

Back to the earlier point on quickly moving across size(a,2), performance is also better when this is small:

julia> a = rand(Float32, M, 2M); c = rand(Float32, size(a,1)); d = similar(c, size(a,2));

julia> @benchmark comp!($d, $a, $c)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     1.174 μs (0.00% GC)
  median time:      1.181 μs (0.00% GC)
  mean time:        1.187 μs (0.00% GC)
  maximum time:     2.734 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     10

julia> @benchmark compllvm!($d, $a, $c)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     1.689 μs (0.00% GC)
  median time:      1.703 μs (0.00% GC)
  mean time:        1.706 μs (0.00% GC)
  maximum time:     4.101 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     10

Of course, what matters is what performs best at the size range you’re actually running on.

Some interesting stats to look at (the first table is with LoopVectorization, the second is without):

julia> using LinuxPerf

julia> a = rand(Float32, M, N); c = rand(Float32, size(a,1)); d = similar(c, size(a,2));

julia> foreachf!(f::F, N, args::Vararg{<:Any,A}) where {F,A} = foreach(_ -> f(args...), Base.OneTo(N))
foreachf! (generic function with 1 method)

julia> @pstats "cpu-cycles,(instructions,branch-instructions,branch-misses),(task-clock,context-switches,cpu-migrations,page-faults),(L1-dcache-load-misses,L1-dcache-loads,L1-icache-load-misses),(dTLB-load-misses,dTLB-loads),(iTLB-load-misses,iTLB-loads)" (@time foreachf!(comp!, 30, d, a, c))
  4.842500 seconds (2 allocations: 64 bytes)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
╶ cpu-cycles               1.97e+10   60.0%  #  4.1 cycles per ns
┌ instructions             4.36e+09   60.0%  #  0.2 insns per cycle
│ branch-instructions      1.71e+08   60.0%  #  3.9% of instructions
└ branch-misses            7.44e+03   60.0%  #  0.0% of branch instructions
┌ task-clock               4.84e+09  100.0%  #  4.8 s
│ context-switches         0.00e+00  100.0%
│ cpu-migrations           0.00e+00  100.0%
└ page-faults              0.00e+00  100.0%
┌ L1-dcache-load-misses    1.01e+09   20.0%  # 88.4% of dcache loads
│ L1-dcache-loads          1.14e+09   20.0%
└ L1-icache-load-misses    1.68e+04   20.0%
┌ dTLB-load-misses         1.45e+03   20.0%  #  0.0% of dTLB loads
└ dTLB-loads               1.14e+09   20.0%
┌ iTLB-load-misses         6.95e+02   40.0%  # 646.5% of iTLB loads
└ iTLB-loads               1.08e+02   40.0%
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

julia> @pstats "cpu-cycles,(instructions,branch-instructions,branch-misses),(task-clock,context-switches,cpu-migrations,page-faults),(L1-dcache-load-misses,L1-dcache-loads,L1-icache-load-misses),(dTLB-load-misses,dTLB-loads),(iTLB-load-misses,iTLB-loads)" (@time foreachf!(compllvm!, 30, d, a, c))
  4.568975 seconds (2 allocations: 64 bytes)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
╶ cpu-cycles               1.86e+10   60.0%  #  4.1 cycles per ns
┌ instructions             8.22e+09   60.0%  #  0.4 insns per cycle
│ branch-instructions      6.23e+08   60.0%  #  7.6% of instructions
└ branch-misses            5.36e+03   60.0%  #  0.0% of branch instructions
┌ task-clock               4.57e+09  100.0%  #  4.6 s
│ context-switches         0.00e+00  100.0%
│ cpu-migrations           0.00e+00  100.0%
└ page-faults              0.00e+00  100.0%
┌ L1-dcache-load-misses    1.01e+09   20.0%  # 50.4% of dcache loads
│ L1-dcache-loads          1.99e+09   20.0%
└ L1-icache-load-misses    1.72e+04   20.0%
┌ dTLB-load-misses         4.20e+02   20.0%  #  0.0% of dTLB loads
└ dTLB-loads               1.99e+09   20.0%
┌ iTLB-load-misses         6.55e+02   40.0%  # 708.1% of iTLB loads
└ iTLB-loads               9.25e+01   40.0%
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

This was the original size where the llvm version was faster. 30 repetitions took 4.8 seconds with LoopVectorization vs 4.6 without.
But LoopVectorization calculated the answer with 4.36e+09 total instructions, while LLVM required nearly twice as many at 8.22e+09.
Of course, what we really care about is runtime, not total instructions used.

This CPU can issue multiple instructions per clock (IPC), but this code ran at only 0.2 and 0.4 IPC for with/without LV. That’s because performance was dominated by L1-dcache-load-misses, where LoopVectorization does a lot worse at the moment.

I plan on eventually having LoopVectorization optimize cache performance as well, but it doesn’t yet. So in problems dominated by it, LoopVectorization might fair poorly even if it does well in reducing the number of instructions required to calculate the answer.

EDIT:
Just for fun, the size(a,1) == 127 example (first table w/ lv, second w/out):

julia> a = rand(Float32, M-1, N); c = rand(Float32, size(a,1)); d = similar(c, size(a,2));

julia> size(a)
(127, 4153344)

julia> @pstats "cpu-cycles,(instructions,branch-instructions,branch-misses),(task-clock,context-switches,cpu-migrations,page-faults),(L1-dcache-load-misses,L1-dcache-loads,L1-icache-load-misses),(dTLB-load-misses,dTLB-loads),(iTLB-load-misses,iTLB-loads)" (@time foreachf!(comp!, 30, d, a, c))
  5.164595 seconds (2 allocations: 64 bytes)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
╶ cpu-cycles               2.10e+10   60.0%  #  4.1 cycles per ns
┌ instructions             4.42e+09   60.0%  #  0.2 insns per cycle
│ branch-instructions      1.87e+08   60.0%  #  4.2% of instructions
└ branch-misses            7.48e+03   60.0%  #  0.0% of branch instructions
┌ task-clock               5.17e+09  100.0%  #  5.2 s
│ context-switches         0.00e+00  100.0%
│ cpu-migrations           0.00e+00  100.0%
└ page-faults              0.00e+00  100.0%
┌ L1-dcache-load-misses    9.97e+08   20.0%  # 87.7% of dcache loads
│ L1-dcache-loads          1.14e+09   20.0%
└ L1-icache-load-misses    3.28e+04   20.0%
┌ dTLB-load-misses         3.06e+04   20.0%  #  0.0% of dTLB loads
└ dTLB-loads               1.14e+09   20.0%
┌ iTLB-load-misses         4.60e+02   40.0%  # 142.6% of iTLB loads
└ iTLB-loads               3.23e+02   40.0%
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

julia> @pstats "cpu-cycles,(instructions,branch-instructions,branch-misses),(task-clock,context-switches,cpu-migrations,page-faults),(L1-dcache-load-misses,L1-dcache-loads,L1-icache-load-misses),(dTLB-load-misses,dTLB-loads),(iTLB-load-misses,iTLB-loads)" (@time foreachf!(compllvm!, 30, d, a, c))
  8.513785 seconds (2 allocations: 64 bytes)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
╶ cpu-cycles               3.64e+10   60.0%  #  4.3 cycles per ns
┌ instructions             5.38e+10   60.0%  #  1.5 insns per cycle
│ branch-instructions      8.47e+09   60.0%  # 15.7% of instructions
└ branch-misses            1.25e+08   60.0%  #  1.5% of branch instructions
┌ task-clock               8.51e+09  100.0%  #  8.5 s
│ context-switches         0.00e+00  100.0%
│ cpu-migrations           0.00e+00  100.0%
└ page-faults              0.00e+00  100.0%
┌ L1-dcache-load-misses    9.98e+08   20.0%  #  6.0% of dcache loads
│ L1-dcache-loads          1.67e+10   20.0%
└ L1-icache-load-misses    3.63e+04   20.0%
┌ dTLB-load-misses         1.67e+04   20.0%  #  0.0% of dTLB loads
└ dTLB-loads               1.67e+10   20.0%
┌ iTLB-load-misses         1.07e+03   40.0%  # 276.1% of iTLB loads
└ iTLB-loads               3.88e+02   40.0%
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Here LoopVectorization performs slightly better, probably in large part because it uses over an order of magnitude less instructions…

xiaodai · August 21, 2020, 10:44am

The GPU version is only slightly faster at 48ms

using CUDA
CUDA.allowscalar(false)

function comp_gpu_kernel!(buffer, a, b)
    i = (blockIdx().x - 1) * blockDim().x + threadIdx().x
    stride = gridDim().x * blockDim().x

    for j = i:stride:size(a, 2)
        for i in 1:128
            buffer[j] += (a[i,j] - b[i])^2
        end
    end
    return
end

function comp_gpu(a, b)
    dist = CUDA.zeros(Float32, size(a, 2))
    CUDA.@sync @cuda threads = 256 blocks = 1024 comp_gpu_kernel!(dist, a,  b)
    dist
end

gpu_a = CuArray(a)
gpu_c = CuArray(c)

@time cgpu = comp_gpu(gpu_a, gpu_c)

xiaodai · August 21, 2020, 10:47am

Using Distances.jl I get similar speed to other at 100ms

using Distances

@benchmark colwise(Euclidean(), a, c)

Elrod · August 21, 2020, 10:59am

Out of curiosity, what GPU do you have?
With my CPU:

julia> comptullio!(d, a, c) = @tullio d[i] = (c[k] * a[k,i])^2;

julia> @benchmark comptullio!($d2, $a, $c)
BenchmarkTools.Trial:
  memory estimate:  17.05 KiB
  allocs estimate:  245
  --------------
  minimum time:     25.386 ms (0.00% GC)
  median time:      25.565 ms (0.00% GC)
  mean time:        25.573 ms (0.00% GC)
  maximum time:     25.808 ms (0.00% GC)
  --------------
  samples:          196
  evals/sample:     1

julia> size(a)
(128, 4153344)

julia> versioninfo()
Julia Version 1.5.1-pre.29
Commit 21e0f044b5* (2020-08-19 20:24 UTC)
Platform Info:
  OS: Linux (x86_64-generic-linux)
  CPU: Intel(R) Core(TM) i9-10980XE CPU @ 3.00GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-9.0.1 (ORCJIT, cascadelake)
Environment:
  JULIA_NUM_THREADS = 18

xiaodai · August 21, 2020, 11:05am

I got a RTX 2080 (no Ti)

xiaodai · August 21, 2020, 11:06am

You have an i9 too.

Topic		Replies	Views
Understanding why one function is faster than the other New to Julia	16	1202	November 1, 2020
LoopVec, Tullio losing to Matrix multiplication Performance	9	736	July 25, 2024
Improving an algorithm that compute gps distances Performance tullio	41	1949	April 27, 2020
[ANN] LoopVectorization Package Announcements	157	23295	May 27, 2020
Simple Mat-Vec multiply (understanding performance, without the bugs) Performance tullio	16	3298	August 12, 2020

How to make LoopVectorization work with computing Euclidean distance for every column?

Related topics