Performance of norm function

norm doesn’t overflow/underflow, it uses this algorithm:

julia> norm([1e-300, 0])

julia> foo1([1e-300, 0])

julia> norm([1e300, 0])

julia> foo1([1e300, 0])

It appears that much of the extra time supports numeric-type generic norm()s.
So, it may be that better dispatch direction could remove some of that for the most common numerical types (Float64, Float32 and Int32, Int64 which are converted to floating point when calculating norms).

No, the main cost is that the data is iterated over twice (once to check whether to rescale, and once to compute the sum). Consider the whacky modification:

julia> using BenchmarkTools, LinearAlgebra
julia> A=rand(5,5);
julia> function foo1(A) 
         x = zero(eltype(A))
         for v in A
           x += v * v

julia> function generic_norm2_(x)
              T = promote_type(Float64, eltype(x))
              sum_ = zero(T)
              max_ = T(-Inf)
              for el in x
              sum_ += LinearAlgebra.norm_sqr(el)
              max_ = ifelse(max_ > norm(el), max_, norm(el))
              iszero(max_) && return zero(eltype(x))
              ( iszero(max_*max_) || !isfinite(max_*max_*length(x)) ) && return norm(x)
              return convert(eltype(x), sqrt(sum_))

julia> @btime norm($A);
  71.419 ns (0 allocations: 0 bytes)

julia> @btime foo1($A);
  23.145 ns (0 allocations: 0 bytes)

julia> @btime generic_norm2_($A);
  30.524 ns (0 allocations: 0 bytes)

julia> under_=[1e-300, 0.0]; over_=[1e+300, 0.0];

julia> norm(under_), foo1(under_), generic_norm2_(under_)
(1.0e-300, 0.0, 1.0e-300)

julia> norm(over_), foo1(over_), generic_norm2_(over_)
(1.0e300, Inf, 1.0e300)

So the loops should probably be fused, since the hypot-case is very unlikely.


Ahh… It does make sense to test after the calculation to determine if the tight loop version of norm failed. As you say, this is a rare occurrence, usually. Even with an application that overflows the tight loop norm half of the time, the savings will out.

Is the same thing happening with the current implementation of dot?

Yeah it’s hitting the generic norm because it’s a small matrix, instead of using BLAS:

julia> mynorm2(A) = sqrt(sum(abs2,A))
mynorm2 (generic function with 1 method)

julia> @btime mynorm2($A)
  16.075 ns (0 allocations: 0 bytes)

julia> @btime LinearAlgebra.generic_norm2($A)
  44.426 ns (0 allocations: 0 bytes)

julia> @btime LinearAlgebra.BLAS.nrm2($A)
  11.106 ns (0 allocations: 0 bytes)

The length cutoff, defined here is 32. Maybe it should be lowered.

A function dedicated for dense matrices could also be made.

1 Like


My above whacky norm variant does not emit good code for large vectors; I’ll see whether I can get a PR that speeds this up.

Nevertheless, I personally hate it and think that stuff like norm should be fast and simple, with predictable quirks like norm([0.0,1e300])==Inf) by default. It should be the user’s job to decide whether to call hypot instead; and the stdlib authors are the ones whose job it should be to figure out a way to coax the compiler into emitting optimal code. I certainly cannot say whether a loop will compile well without reading @code_native.

Hah, thanks!

So no reason to make a PR, since this affects tiny vectors only, and people caring about speed on tiny vectors presumably use SVector already.

That being said, the obvious fast variant still outperforms the BLAS on my system (which presumably is a misconfiguration / build issue).

julia> A=rand(10_000);
julia> function foo2(A) 
                x = zero(eltype(A))
                @inbounds  @simd for v in A
                  @fastmath x += v * v
                @fastmath sqrt(x)
julia> @btime norm($A)
  4.643 μs (0 allocations: 0 bytes)

julia> @btime foo2($A)
  1.376 μs (0 allocations: 0 bytes)

I read this and thought “Of course norm(A) is slower than computing the sqrt of the sum of squares of the elements. norm(A) is the matrix 2-norm, which requires an SVD.”

But it isn’t! For norm(A) Julia computes the Frobenius norm, the sqrt sum of squares. Seems like a strange design choice to me, but maybe on the other handit can be counted as a case of choosing a more efficient but arguably less useful algorithm.

This decision was made pretty recently (for 0.7/1.0), as most people use this as a way to check if matrices are approximately equal, in which case it is usually sufficient and much faster. The operator norm is now called opnorm.


Yeah @simd really helps here compared to BLAS (not sure why BLAS isn’t using simd though).

Note that you don’t need this whole function to get that kind of speed, the above definition mynorm2(A) = sqrt(sum(abs2,A)) gets you most of the way:

julia> @btime mynorm2($A)          
  1.812 μs (0 allocations: 0 bytes)
julia> @btime foo2($A)             
  1.432 μs (0 allocations: 0 bytes)

(OK I suppose that’s a 20% difference…)

1 Like

So it seems like the first rule of getting a fast norm, is not to use norm. Curiously enough while sqrt(sum(abs2,A)) is normally faster, right at 32 elements the timing for norm plunges and beats the sum for a little while. Thankfully rolling your own takes less macro decoration, and we have two good alternatives to play with: @simd and @avx.

jnorm(A) = norm(A)
jsumnorm(A) = sqrt( sum( abs2, A ))

function jsimdnorm(A)
	∑= zero(eltype(A))
	@simd for i ∈ A
		∑ += i * i
	sqrt( ∑ )

function javxnorm(A)
	∑= zero(eltype(A))
	@avx for i ∈ eachindex(A)
		∑ += A[i] * A[i]
	sqrt( ∑ )

Running a simple benchmark for 128 and 1024 elements on a lower power Ryzen 5 (julia 1.5.3) gets these times:

             128         1024
jnorm      65.600 ns   513.800 ns
jsumnorm   38.200 ns   125.200 ns
jsimdnorm  22.200 ns   108.200 ns
javxnorm   16.400 ns    77.400 ns

You can see my full notebook here My appologies that the calculations for the final chart take several minutes to collect.


The reason that norm is slow might be that can be significantly more accurate than sqrt(sum(abs2, x)) when elements of x span many orders of magnitude.


I get that, as it was mentioned before. Just pointing out that the cost of that optional accuracy (even in cases where you don’t need it or want it), can be upwards of 4-6X these days.

If norm is that much slower, it would be interesting to see if we could get better performance with extended precision. All it would require is doing the accumulation (not the squaring) in twice precision, which would be a change from 2 operations per element (technically 1 with fma) to 5 (4 with fma). For 32 bit and smaller datatypes, it could probably done even more cheaply by just using single operations with the wider type.


Good reminder. The code for the norm using this is just jcompnorm(A) = sqrt(dot_oro(A,A)) which should be really fast. (package dependencies prevent me from testing this easily though).

1 Like

Doing the sum in extended precision doesn’t save sqrt(x^2 + y^2) from overflow or underflow of the squares. E.g.

julia> using LinearAlgebra

julia> x=3e-200; y = 4e-200;

julia> norm([x; y])

julia> sqrt(x^2 + y^2)

julia> sqrt(big(x^2) + big(y^2))

You need to convert to big before squaring.

1 Like

To calculate that correctly, initialize x and y as BigFloats using strings.

julia> x = BigFloat("3e-200"); y = BigFloat("4e-200");
julia> sqrt(x^2 + y^2) 

otherwise the erosion has already occured when the conversion to BigFloat happens

1 Like

Yes, I was responding to “All it would require is doing the accumulation (not the squaring) in twice precision.” If efficiency is the concern, better to use the n-d generalization of the hypot algorithm |x| \sqrt{1 +(y/x)^2} in the original precision, where x is the largest element of the vector being summed.