 # Computing L1 norm between vectors: julian way

I need to code a function that computes the L1 distance between the two vectors. It doesn’t need to be super precise — it serves as the estimate for some error as part of the stopping criterion of an algorithm—, but it needs to be fast since it will be invoked quite frequently. I guess the most straightforward way to write this in julia would be:

``````function L1_vec(a, b)
sum(abs.(a .- b))
end
``````

But of course this allocates a vector. The results of a simple benchmark are:

``````julia> N = 1000; a = rand(N); b = rand(N); @benchmark L1_vec(\$a, \$b)

BenchmarkTools.Trial:
memory estimate:  7.94 KiB
allocs estimate:  1
--------------
minimum time:     856.400 ns (0.00% GC)
median time:      1.004 μs (0.00% GC)
mean time:        1.412 μs (18.29% GC)
maximum time:     62.230 μs (91.36% GC)
--------------
samples:          10000
evals/sample:     35
``````

My second try was to just iterate over the vectors, adding the difference to a counter and applying vectorization. All in all I ended up with the following function:

``````function L1_non_alloc(a, b)
s = 0.0
@simd for i in eachindex(a)
@inbounds s += abs(a[i] - b[i])
end
return s
end
``````

This version gets rid of the unwanted allocations and is around x10 faster than the previous in the same simple benchmark:

``````julia> @benchmark L1_non_alloc(\$a, \$b)

BenchmarkTools.Trial:
memory estimate:  0 bytes
allocs estimate:  0
--------------
minimum time:     103.950 ns (0.00% GC)
median time:      120.109 ns (0.00% GC)
mean time:        127.151 ns (0.00% GC)
maximum time:     301.482 ns (0.00% GC)
--------------
samples:          10000
evals/sample:     900
``````

My questions now are:

1. Is there a more concise/julian way to write something similar to `L1_non_alloc`? I am thinking of something similar to `map` but that only saves the sum of applying some function `f` (in this case `(x,y) -> abs(x-y)` ) to the input arrays.
2. Is there any obvious improvement to be done in `L1_non_alloc` to make it more performant?

Thanks a lot!

1 Like

You can write it allocation-free and reasonably concise with a generator,

``````sum(abs(a - b) for (a, b) in zip(a, b))
``````

but it won’t give you the best performance.

4 Likes

You can use functional form of `sum`

``````using BenchmarkTools

x1 = rand(100);
x2 = rand(100);

sum(x -> abs(x - x), zip(x1, x2))

julia> @btime sum(x -> abs(x - x), zip(\$x1, \$x2))
99.180 ns (0 allocations: 0 bytes)
``````

You can use loops

``````function l1(x1, x2)
res = zero(eltype(x1))
@inbounds @simd for i in eachindex(x1)
res += abs(x1[i] - x2[i])
end

return res
end

julia> @btime l1(\$x1, \$x2)
16.765 ns (0 allocations: 0 bytes)
``````

If you have short enough vectors, you can use immutable structures like `SArray` or just plain tuples

``````y1 = tuple(rand(10)...)
y2 = tuple(rand(10)...)

sum(ntuple(i -> abs(y1[i] - y2[i]), length(y1)))

julia> @btime sum(ntuple(i -> abs(\$(Ref(y1))[][i] - \$(Ref(y2))[][i]), length(\$y1)))
4.170 ns (0 allocations: 0 bytes)
``````

It is slightly faster than loop version

``````z1 = collect(y1)
z2 = collect(y2)

julia> @btime l1(\$z1, \$z2)
6.606 ns (0 allocations: 0 bytes)
``````
5 Likes

Not competitive in terms of performance (because it allocates), but arguably concise:

``````julia> using LinearAlgebra

julia> norm(a - b, 1)
345.38374238811525
``````
8 Likes

Using GitHub - JuliaSIMD/LoopVectorization.jl: Macro(s) for vectorizing loops. I get about 4% speedup vs the simd-loop version. But it sure takes time to compile. Probably not worth it unless you use LoopVec anyway.

``````julia> function L1_avx(a, b)
s = 0.0
@avx for i in eachindex(a)
s += abs(a[i] - b[i])
end
return s
end
``````
2 Likes

For me there is a 10x difference between the explicit loop and the generator version. That’s a bit more than I would have hoped for. Is it simd vs non-simd? A bit unfortunate that the difference is so great, as I had some idea that there was little cost to the more abstract constructs.

1 Like

What about `mapreduce`?

``````L1_mapreduce(a, b) = mapreduce((x, y) -> abs(x - y), +, a, b)

julia> @benchmark L1_vec(\$a, \$b)
BenchmarkTools.Trial:
memory estimate:  7.94 KiB
allocs estimate:  1
--------------
minimum time:     905.077 ns (0.00% GC)
median time:      1.060 μs (0.00% GC)
mean time:        1.443 μs (8.61% GC)
maximum time:     1.264 ms (98.26% GC)
--------------
samples:          10000
evals/sample:     13

julia> @benchmark L1_mapreduce(\$a, \$b)
BenchmarkTools.Trial:
memory estimate:  8.05 KiB
allocs estimate:  5
--------------
minimum time:     1.920 μs (0.00% GC)
median time:      2.183 μs (0.00% GC)
mean time:        2.802 μs (5.08% GC)
maximum time:     1.454 ms (97.79% GC)
--------------
samples:          10000
evals/sample:     10
``````
4 Likes

Why `mapreduce` version allocate so much? I thought it was more performant.

@mauro3, in Win10 Julia 1.6.0, your LoopVec @avx solution seems to be the fastest, and about ~30% faster than @simd, in my machine. For the OP example of vectors with 1000 elements it btimes at 70 ns vs 97 ns.

1 Like

I am not sure, but `mapfoldl` seems to be OK:

``````julia> L1_mapfoldl(a, b) = mapfoldl(((x, y),) -> abs(x - y), +, zip(a, b))
L1_mapfoldl (generic function with 1 method)

julia> @benchmark L1_mapfoldl(\$a, \$b)
BenchmarkTools.Trial:
memory estimate:  0 bytes
allocs estimate:  0
--------------
minimum time:     1.179 μs (0.00% GC)
median time:      1.295 μs (0.00% GC)
mean time:        1.317 μs (0.00% GC)
maximum time:     6.398 μs (0.00% GC)
--------------
samples:          10000
evals/sample:     10
``````
4 Likes

Sorry if have missed something above but could someone shed some light on why the OP’s function `L1_vect()` seems to be faster than the plain loop versions (without @avx or @simd)?

It’s essentially a placeholder implementation, #38558 is the issue about this:

``````julia> @less mapreduce((x, y) -> abs(x - y), +, x1, x2)
# mapreduce(f, op, A::AbstractArrayOrBroadcasted...; kw...) =
#     reduce(op, map(f, A...); kw...)
``````
2 Likes

Do you mean something like this?

``````function L1_loop(a, b)
@assert length(a) == length(b)
s = 0.0
for i in eachindex(a)
s += abs(a[i] - b[i])
end
return s
end
``````

In my computer the simple loop is usually faster than `L1_vec`. If `L1_vec` goes sometimes faster, I guess that it can be because the function `sum` might have some optimizations going on under the hood.

@ismedina, yes it runs in 1.0 μs (0 allocations: 0 bytes), while the vectorial version allocates but runs in 586 ns (1 allocation: 7.94 KiB).
Same result for loop variant below:

``````function L1_loop2(a,b)
s = zero(eltype(a))
for (a,b) in zip(a,b)
s += abs(a -b)
end
return s
end
``````
1 Like

I also find some speedup with `@avx` (specially with big vectors, up to 30%), thank you a lot! I think it needs benchmarking inside the specific algorithm to see if it can really benefit from this speedup, but if so including the additional dependence might be justified 1 Like