I need to code a function that computes the L1 distance between the two vectors. It doesn’t need to be super precise — it serves as the estimate for some error as part of the stopping criterion of an algorithm—, but it needs to be fast since it will be invoked quite frequently. I guess the most straightforward way to write this in julia would be:
function L1_vec(a, b)
sum(abs.(a .- b))
end
But of course this allocates a vector. The results of a simple benchmark are:
julia> N = 1000; a = rand(N); b = rand(N); @benchmark L1_vec($a, $b)
BenchmarkTools.Trial:
memory estimate: 7.94 KiB
allocs estimate: 1
--------------
minimum time: 856.400 ns (0.00% GC)
median time: 1.004 μs (0.00% GC)
mean time: 1.412 μs (18.29% GC)
maximum time: 62.230 μs (91.36% GC)
--------------
samples: 10000
evals/sample: 35
My second try was to just iterate over the vectors, adding the difference to a counter and applying vectorization. All in all I ended up with the following function:
function L1_non_alloc(a, b)
s = 0.0
@simd for i in eachindex(a)
@inbounds s += abs(a[i] - b[i])
end
return s
end
This version gets rid of the unwanted allocations and is around x10 faster than the previous in the same simple benchmark:
Is there a more concise/julian way to write something similar to L1_non_alloc? I am thinking of something similar to map but that only saves the sum of applying some function f (in this case (x,y) -> abs(x-y) ) to the input arrays.
Is there any obvious improvement to be done in L1_non_alloc to make it more performant?
function l1(x1, x2)
res = zero(eltype(x1))
@inbounds @simd for i in eachindex(x1)
res += abs(x1[i] - x2[i])
end
return res
end
julia> @btime l1($x1, $x2)
16.765 ns (0 allocations: 0 bytes)
If you have short enough vectors, you can use immutable structures like SArray or just plain tuples
For me there is a 10x difference between the explicit loop and the generator version. That’s a bit more than I would have hoped for. Is it simd vs non-simd? A bit unfortunate that the difference is so great, as I had some idea that there was little cost to the more abstract constructs.
@mauro3, in Win10 Julia 1.6.0, your LoopVec @avx solution seems to be the fastest, and about ~30% faster than @simd, in my machine. For the OP example of vectors with 1000 elements it btimes at 70 ns vs 97 ns.
Sorry if have missed something above but could someone shed some light on why the OP’s function L1_vect() seems to be faster than the plain loop versions (without @avx or @simd)?
function L1_loop(a, b)
@assert length(a) == length(b)
s = 0.0
for i in eachindex(a)
s += abs(a[i] - b[i])
end
return s
end
In my computer the simple loop is usually faster than L1_vec. If L1_vec goes sometimes faster, I guess that it can be because the function sum might have some optimizations going on under the hood.
@ismedina, yes it runs in 1.0 μs (0 allocations: 0 bytes), while the vectorial version allocates but runs in 586 ns (1 allocation: 7.94 KiB).
Same result for loop variant below:
function L1_loop2(a,b)
s = zero(eltype(a))
for (a,b) in zip(a,b)
s += abs(a -b)
end
return s
end
I also find some speedup with @avx (specially with big vectors, up to 30%), thank you a lot! I think it needs benchmarking inside the specific algorithm to see if it can really benefit from this speedup, but if so including the additional dependence might be justified