Computing L1 norm between vectors: julian way

ismedina · April 6, 2021, 11:59am

I need to code a function that computes the L1 distance between the two vectors. It doesn’t need to be super precise — it serves as the estimate for some error as part of the stopping criterion of an algorithm—, but it needs to be fast since it will be invoked quite frequently. I guess the most straightforward way to write this in julia would be:

function L1_vec(a, b)
    sum(abs.(a .- b))
end

But of course this allocates a vector. The results of a simple benchmark are:

julia> N = 1000; a = rand(N); b = rand(N); @benchmark L1_vec($a, $b)

BenchmarkTools.Trial: 
  memory estimate:  7.94 KiB
  allocs estimate:  1
  --------------
  minimum time:     856.400 ns (0.00% GC)
  median time:      1.004 μs (0.00% GC)
  mean time:        1.412 μs (18.29% GC)
  maximum time:     62.230 μs (91.36% GC)
  --------------
  samples:          10000
  evals/sample:     35

My second try was to just iterate over the vectors, adding the difference to a counter and applying vectorization. All in all I ended up with the following function:

function L1_non_alloc(a, b)
    s = 0.0
    @simd for i in eachindex(a)
        @inbounds s += abs(a[i] - b[i])
    end
    return s
end

This version gets rid of the unwanted allocations and is around x10 faster than the previous in the same simple benchmark:

julia> @benchmark L1_non_alloc($a, $b)

BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     103.950 ns (0.00% GC)
  median time:      120.109 ns (0.00% GC)
  mean time:        127.151 ns (0.00% GC)
  maximum time:     301.482 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     900

My questions now are:

Is there a more concise/julian way to write something similar to L1_non_alloc? I am thinking of something similar to map but that only saves the sum of applying some function f (in this case (x,y) -> abs(x-y) ) to the input arrays.
Is there any obvious improvement to be done in L1_non_alloc to make it more performant?

Thanks a lot!

GunnarFarneback · April 6, 2021, 12:10pm

You can write it allocation-free and reasonably concise with a generator,

sum(abs(a - b) for (a, b) in zip(a, b))

but it won’t give you the best performance.

Skoffer · April 6, 2021, 12:14pm

You can use functional form of sum

using BenchmarkTools

x1 = rand(100);
x2 = rand(100);

sum(x -> abs(x[1] - x[2]), zip(x1, x2))

julia> @btime sum(x -> abs(x[1] - x[2]), zip($x1, $x2))
  99.180 ns (0 allocations: 0 bytes)

You can use loops

function l1(x1, x2)
    res = zero(eltype(x1))
    @inbounds @simd for i in eachindex(x1)
        res += abs(x1[i] - x2[i])
    end

    return res
end

julia> @btime l1($x1, $x2)
  16.765 ns (0 allocations: 0 bytes)

If you have short enough vectors, you can use immutable structures like SArray or just plain tuples

y1 = tuple(rand(10)...)
y2 = tuple(rand(10)...)

sum(ntuple(i -> abs(y1[i] - y2[i]), length(y1)))

julia> @btime sum(ntuple(i -> abs($(Ref(y1))[][i] - $(Ref(y2))[][i]), length($y1)))
  4.170 ns (0 allocations: 0 bytes)

It is slightly faster than loop version

z1 = collect(y1)
z2 = collect(y2)

julia> @btime l1($z1, $z2)
  6.606 ns (0 allocations: 0 bytes)

carstenbauer · April 6, 2021, 12:19pm

Not competitive in terms of performance (because it allocates), but arguably concise:

julia> using LinearAlgebra

julia> norm(a - b, 1)
345.38374238811525

mauro3 · April 6, 2021, 12:30pm

Using https://github.com/JuliaSIMD/LoopVectorization.jl I get about 4% speedup vs the simd-loop version. But it sure takes time to compile. Probably not worth it unless you use LoopVec anyway.

julia> function L1_avx(a, b)
           s = 0.0
           @avx for i in eachindex(a)
               s += abs(a[i] - b[i])
           end
           return s
       end

DNF · April 6, 2021, 1:00pm

For me there is a 10x difference between the explicit loop and the generator version. That’s a bit more than I would have hoped for. Is it simd vs non-simd? A bit unfortunate that the difference is so great, as I had some idea that there was little cost to the more abstract constructs.

Tamas_Papp · April 6, 2021, 1:02pm

What about mapreduce?

L1_mapreduce(a, b) = mapreduce((x, y) -> abs(x - y), +, a, b)

julia> @benchmark L1_vec($a, $b)
BenchmarkTools.Trial: 
  memory estimate:  7.94 KiB
  allocs estimate:  1
  --------------
  minimum time:     905.077 ns (0.00% GC)
  median time:      1.060 μs (0.00% GC)
  mean time:        1.443 μs (8.61% GC)
  maximum time:     1.264 ms (98.26% GC)
  --------------
  samples:          10000
  evals/sample:     13

julia> @benchmark L1_mapreduce($a, $b)
BenchmarkTools.Trial: 
  memory estimate:  8.05 KiB
  allocs estimate:  5
  --------------
  minimum time:     1.920 μs (0.00% GC)
  median time:      2.183 μs (0.00% GC)
  mean time:        2.802 μs (5.08% GC)
  maximum time:     1.454 ms (97.79% GC)
  --------------
  samples:          10000
  evals/sample:     10

Skoffer · April 6, 2021, 1:30pm

Why mapreduce version allocate so much? I thought it was more performant.

rafael.guerra · April 6, 2021, 2:08pm

@mauro3, in Win10 Julia 1.6.0, your LoopVec @avx solution seems to be the fastest, and about ~30% faster than @simd, in my machine. For the OP example of vectors with 1000 elements it btimes at 70 ns vs 97 ns.

Tamas_Papp · April 6, 2021, 2:21pm

I am not sure, but mapfoldl seems to be OK:

julia> L1_mapfoldl(a, b) = mapfoldl(((x, y),) -> abs(x - y), +, zip(a, b))
L1_mapfoldl (generic function with 1 method)

julia> @benchmark L1_mapfoldl($a, $b)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     1.179 μs (0.00% GC)
  median time:      1.295 μs (0.00% GC)
  mean time:        1.317 μs (0.00% GC)
  maximum time:     6.398 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     10

rafael.guerra · April 6, 2021, 3:24pm

Sorry if have missed something above but could someone shed some light on why the OP’s function L1_vect() seems to be faster than the plain loop versions (without @avx or @simd)?

mcabbott · April 6, 2021, 3:31pm

It’s essentially a placeholder implementation, #38558 is the issue about this:

julia> @less mapreduce((x, y) -> abs(x - y), +, x1, x2)
# mapreduce(f, op, A::AbstractArrayOrBroadcasted...; kw...) =
#     reduce(op, map(f, A...); kw...)

ismedina · April 6, 2021, 3:32pm

Do you mean something like this?

function L1_loop(a, b)
	@assert length(a) == length(b)
    s = 0.0
    for i in eachindex(a)
        s += abs(a[i] - b[i])
    end
    return s
end

In my computer the simple loop is usually faster than L1_vec. If L1_vec goes sometimes faster, I guess that it can be because the function sum might have some optimizations going on under the hood.

rafael.guerra · April 6, 2021, 3:37pm

@ismedina, yes it runs in 1.0 μs (0 allocations: 0 bytes), while the vectorial version allocates but runs in 586 ns (1 allocation: 7.94 KiB).
Same result for loop variant below:

function L1_loop2(a,b)
    s = zero(eltype(a))
    for (a,b) in zip(a,b)
        s += abs(a -b)
    end
    return s
end

ismedina · April 6, 2021, 6:34pm

I also find some speedup with @avx (specially with big vectors, up to 30%), thank you a lot! I think it needs benchmarking inside the specific algorithm to see if it can really benefit from this speedup, but if so including the additional dependence might be justified

Topic		Replies	Views
Non-allocating distance function Internals & Design question	6	422	November 3, 2022
What is idiomatic way of vectorizing longer operations in Julia? Performance performance , loopvectorization	5	420	September 21, 2023
Sum isn't faster than a loop? New to Julia	11	1852	June 24, 2020
Loop vs vectorization Performance	4	1493	October 10, 2023
Question: Improve the speed of for loop Performance question	20	1577	April 12, 2023

Computing L1 norm between vectors: julian way

Related topics