Custom for-loop seems 3x faster than LinearAlgebra’s norm…

julia> using LinearAlgebra, BenchmarkTools
julia> A = rand(5,5);
julia> function foo1(A)
x = zero(eltype(A))
for v in A
x += v * v
end
sqrt(x)
end
foo1 (generic function with 1 method)
julia> @btime norm($A)
55.623 ns (0 allocations: 0 bytes)
3.349182349398459
julia> @btime foo1($A)
18.336 ns (0 allocations: 0 bytes)
3.349182349398459

Hmm. That is almost to be expected, seeing that lots of basic math operations are implemented in a very slow way to squeeze out a little more precision (see: min and max for NaN-handling, abs for complex numbers, the crazyness that is complex division). While I personally hate this fact, it has a certain consistency.

More annoyingly f(x)=@fastmath norm(x) refuses to do the right thing.

If one were to fix this, i.e. write a fastmath-norm, how would one do this? (the problem preventing me from writing short PR is that this needs to live in LinearAlgebra and cannot live in fastmath).

It appears that much of the extra time supports numeric-type generic norm()s.
So, it may be that better dispatch direction could remove some of that for the most common numerical types (Float64, Float32 and Int32, Int64 which are converted to floating point when calculating norms).

No, the main cost is that the data is iterated over twice (once to check whether to rescale, and once to compute the sum). Consider the whacky modification:

julia> using BenchmarkTools, LinearAlgebra
julia> A=rand(5,5);
julia> function foo1(A)
x = zero(eltype(A))
for v in A
x += v * v
end
sqrt(x)
end
julia> function generic_norm2_(x)
T = promote_type(Float64, eltype(x))
sum_ = zero(T)
max_ = T(-Inf)
for el in x
sum_ += LinearAlgebra.norm_sqr(el)
max_ = ifelse(max_ > norm(el), max_, norm(el))
end
iszero(max_) && return zero(eltype(x))
( iszero(max_*max_) || !isfinite(max_*max_*length(x)) ) && return norm(x)
return convert(eltype(x), sqrt(sum_))
end
julia> @btime norm($A);
71.419 ns (0 allocations: 0 bytes)
julia> @btime foo1($A);
23.145 ns (0 allocations: 0 bytes)
julia> @btime generic_norm2_($A);
30.524 ns (0 allocations: 0 bytes)
julia> under_=[1e-300, 0.0]; over_=[1e+300, 0.0];
julia> norm(under_), foo1(under_), generic_norm2_(under_)
(1.0e-300, 0.0, 1.0e-300)
julia> norm(over_), foo1(over_), generic_norm2_(over_)
(1.0e300, Inf, 1.0e300)

So the loops should probably be fused, since the hypot-case is very unlikely.

Ahh… It does make sense to test after the calculation to determine if the tight loop version of norm failed. As you say, this is a rare occurrence, usually. Even with an application that overflows the tight loop norm half of the time, the savings will out.

Is the same thing happening with the current implementation of dot?

My above whacky norm variant does not emit good code for large vectors; I’ll see whether I can get a PR that speeds this up.

Nevertheless, I personally hate it and think that stuff like norm should be fast and simple, with predictable quirks like norm([0.0,1e300])==Inf) by default. It should be the user’s job to decide whether to call hypot instead; and the stdlib authors are the ones whose job it should be to figure out a way to coax the compiler into emitting optimal code. I certainly cannot say whether a loop will compile well without reading @code_native.

So no reason to make a PR, since this affects tiny vectors only, and people caring about speed on tiny vectors presumably use SVector already.

That being said, the obvious fast variant still outperforms the BLAS on my system (which presumably is a misconfiguration / build issue).

julia> A=rand(10_000);
julia> function foo2(A)
x = zero(eltype(A))
@inbounds @simd for v in A
@fastmath x += v * v
end
@fastmath sqrt(x)
end
julia> @btime norm($A)
4.643 μs (0 allocations: 0 bytes)
57.51021904090062
julia> @btime foo2($A)
1.376 μs (0 allocations: 0 bytes)
57.51021904090062

I read this and thought “Of course norm(A) is slower than computing the sqrt of the sum of squares of the elements. norm(A) is the matrix 2-norm, which requires an SVD.”

But it isn’t! For norm(A) Julia computes the Frobenius norm, the sqrt sum of squares. Seems like a strange design choice to me, but maybe on the other handit can be counted as a case of choosing a more efficient but arguably less useful algorithm.

This decision was made pretty recently (for 0.7/1.0), as most people use this as a way to check if matrices are approximately equal, in which case it is usually sufficient and much faster. The operator norm is now called opnorm.

So it seems like the first rule of getting a fast norm, is not to use norm. Curiously enough while sqrt(sum(abs2,A)) is normally faster, right at 32 elements the timing for norm plunges and beats the sum for a little while. Thankfully rolling your own takes less macro decoration, and we have two good alternatives to play with: @simd and @avx.

jnorm(A) = norm(A)
jsumnorm(A) = sqrt( sum( abs2, A ))
function jsimdnorm(A)
∑= zero(eltype(A))
@simd for i ∈ A
∑ += i * i
end
sqrt( ∑ )
end
function javxnorm(A)
∑= zero(eltype(A))
@avx for i ∈ eachindex(A)
∑ += A[i] * A[i]
end
sqrt( ∑ )
end

Running a simple benchmark for 128 and 1024 elements on a lower power Ryzen 5 (julia 1.5.3) gets these times:

The reason that norm is slow might be that can be significantly more accurate than sqrt(sum(abs2, x)) when elements of x span many orders of magnitude.

I get that, as it was mentioned before. Just pointing out that the cost of that optional accuracy (even in cases where you don’t need it or want it), can be upwards of 4-6X these days.

If norm is that much slower, it would be interesting to see if we could get better performance with extended precision. All it would require is doing the accumulation (not the squaring) in twice precision, which would be a change from 2 operations per element (technically 1 with fma) to 5 (4 with fma). For 32 bit and smaller datatypes, it could probably done even more cheaply by just using single operations with the wider type.

Good reminder. The code for the norm using this is just jcompnorm(A) = sqrt(dot_oro(A,A)) which should be really fast. (package dependencies prevent me from testing this easily though).