Computing L1 norm between vectors: julian way

I need to code a function that computes the L1 distance between the two vectors. It doesn’t need to be super precise — it serves as the estimate for some error as part of the stopping criterion of an algorithm—, but it needs to be fast since it will be invoked quite frequently. I guess the most straightforward way to write this in julia would be:

function L1_vec(a, b)
    sum(abs.(a .- b))
end

But of course this allocates a vector. The results of a simple benchmark are:

julia> N = 1000; a = rand(N); b = rand(N); @benchmark L1_vec($a, $b)

BenchmarkTools.Trial: 
  memory estimate:  7.94 KiB
  allocs estimate:  1
  --------------
  minimum time:     856.400 ns (0.00% GC)
  median time:      1.004 μs (0.00% GC)
  mean time:        1.412 μs (18.29% GC)
  maximum time:     62.230 μs (91.36% GC)
  --------------
  samples:          10000
  evals/sample:     35

My second try was to just iterate over the vectors, adding the difference to a counter and applying vectorization. All in all I ended up with the following function:

function L1_non_alloc(a, b)
    s = 0.0
    @simd for i in eachindex(a)
        @inbounds s += abs(a[i] - b[i])
    end
    return s
end

This version gets rid of the unwanted allocations and is around x10 faster than the previous in the same simple benchmark:

julia> @benchmark L1_non_alloc($a, $b)

BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     103.950 ns (0.00% GC)
  median time:      120.109 ns (0.00% GC)
  mean time:        127.151 ns (0.00% GC)
  maximum time:     301.482 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     900

My questions now are:

  1. Is there a more concise/julian way to write something similar to L1_non_alloc? I am thinking of something similar to map but that only saves the sum of applying some function f (in this case (x,y) -> abs(x-y) ) to the input arrays.
  2. Is there any obvious improvement to be done in L1_non_alloc to make it more performant?

Thanks a lot!

1 Like

You can write it allocation-free and reasonably concise with a generator,

sum(abs(a - b) for (a, b) in zip(a, b))

but it won’t give you the best performance.

4 Likes

You can use functional form of sum

using BenchmarkTools

x1 = rand(100);
x2 = rand(100);

sum(x -> abs(x[1] - x[2]), zip(x1, x2))

julia> @btime sum(x -> abs(x[1] - x[2]), zip($x1, $x2))
  99.180 ns (0 allocations: 0 bytes)

You can use loops

function l1(x1, x2)
    res = zero(eltype(x1))
    @inbounds @simd for i in eachindex(x1)
        res += abs(x1[i] - x2[i])
    end

    return res
end

julia> @btime l1($x1, $x2)
  16.765 ns (0 allocations: 0 bytes)

If you have short enough vectors, you can use immutable structures like SArray or just plain tuples

y1 = tuple(rand(10)...)
y2 = tuple(rand(10)...)

sum(ntuple(i -> abs(y1[i] - y2[i]), length(y1)))

julia> @btime sum(ntuple(i -> abs($(Ref(y1))[][i] - $(Ref(y2))[][i]), length($y1)))
  4.170 ns (0 allocations: 0 bytes)

It is slightly faster than loop version

z1 = collect(y1)
z2 = collect(y2)

julia> @btime l1($z1, $z2)
  6.606 ns (0 allocations: 0 bytes)
5 Likes

Not competitive in terms of performance (because it allocates), but arguably concise:

julia> using LinearAlgebra

julia> norm(a - b, 1)
345.38374238811525
8 Likes

Using https://github.com/JuliaSIMD/LoopVectorization.jl I get about 4% speedup vs the simd-loop version. But it sure takes time to compile. Probably not worth it unless you use LoopVec anyway.

julia> function L1_avx(a, b)
           s = 0.0
           @avx for i in eachindex(a)
               s += abs(a[i] - b[i])
           end
           return s
       end
2 Likes

For me there is a 10x difference between the explicit loop and the generator version. That’s a bit more than I would have hoped for. Is it simd vs non-simd? A bit unfortunate that the difference is so great, as I had some idea that there was little cost to the more abstract constructs.

1 Like

What about mapreduce?

L1_mapreduce(a, b) = mapreduce((x, y) -> abs(x - y), +, a, b)

julia> @benchmark L1_vec($a, $b)
BenchmarkTools.Trial: 
  memory estimate:  7.94 KiB
  allocs estimate:  1
  --------------
  minimum time:     905.077 ns (0.00% GC)
  median time:      1.060 μs (0.00% GC)
  mean time:        1.443 μs (8.61% GC)
  maximum time:     1.264 ms (98.26% GC)
  --------------
  samples:          10000
  evals/sample:     13

julia> @benchmark L1_mapreduce($a, $b)
BenchmarkTools.Trial: 
  memory estimate:  8.05 KiB
  allocs estimate:  5
  --------------
  minimum time:     1.920 μs (0.00% GC)
  median time:      2.183 μs (0.00% GC)
  mean time:        2.802 μs (5.08% GC)
  maximum time:     1.454 ms (97.79% GC)
  --------------
  samples:          10000
  evals/sample:     10
4 Likes

Why mapreduce version allocate so much? I thought it was more performant.

@mauro3, in Win10 Julia 1.6.0, your LoopVec @avx solution seems to be the fastest, and about ~30% faster than @simd, in my machine. For the OP example of vectors with 1000 elements it btimes at 70 ns vs 97 ns.

1 Like

I am not sure, but mapfoldl seems to be OK:

julia> L1_mapfoldl(a, b) = mapfoldl(((x, y),) -> abs(x - y), +, zip(a, b))
L1_mapfoldl (generic function with 1 method)

julia> @benchmark L1_mapfoldl($a, $b)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     1.179 μs (0.00% GC)
  median time:      1.295 μs (0.00% GC)
  mean time:        1.317 μs (0.00% GC)
  maximum time:     6.398 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     10
4 Likes

Sorry if have missed something above but could someone shed some light on why the OP’s function L1_vect() seems to be faster than the plain loop versions (without @avx or @simd)?

It’s essentially a placeholder implementation, #38558 is the issue about this:

julia> @less mapreduce((x, y) -> abs(x - y), +, x1, x2)
# mapreduce(f, op, A::AbstractArrayOrBroadcasted...; kw...) =
#     reduce(op, map(f, A...); kw...)
2 Likes

Do you mean something like this?

function L1_loop(a, b)
	@assert length(a) == length(b)
    s = 0.0
    for i in eachindex(a)
        s += abs(a[i] - b[i])
    end
    return s
end

In my computer the simple loop is usually faster than L1_vec. If L1_vec goes sometimes faster, I guess that it can be because the function sum might have some optimizations going on under the hood.

@ismedina, yes it runs in 1.0 μs (0 allocations: 0 bytes), while the vectorial version allocates but runs in 586 ns (1 allocation: 7.94 KiB).
Same result for loop variant below:

function L1_loop2(a,b)
    s = zero(eltype(a))
    for (a,b) in zip(a,b)
        s += abs(a -b)
    end
    return s
end
1 Like

I also find some speedup with @avx (specially with big vectors, up to 30%), thank you a lot! I think it needs benchmarking inside the specific algorithm to see if it can really benefit from this speedup, but if so including the additional dependence might be justified :slight_smile:

1 Like