Performance of norm function

Custom for-loop seems 3x faster than LinearAlgebra’s norm…

julia> using LinearAlgebra, BenchmarkTools

julia> A = rand(5,5);

julia> function foo1(A) 
         x = zero(eltype(A))
         for v in A
           x += v * v
         end
         sqrt(x)
       end
foo1 (generic function with 1 method)

julia> @btime norm($A)
  55.623 ns (0 allocations: 0 bytes)
3.349182349398459

julia> @btime foo1($A)
  18.336 ns (0 allocations: 0 bytes)
3.349182349398459
1 Like

Hmm. That is almost to be expected, seeing that lots of basic math operations are implemented in a very slow way to squeeze out a little more precision (see: min and max for NaN-handling, abs for complex numbers, the crazyness that is complex division). While I personally hate this fact, it has a certain consistency.

More annoyingly f(x)=@fastmath norm(x) refuses to do the right thing.

If one were to fix this, i.e. write a fastmath-norm, how would one do this? (the problem preventing me from writing short PR is that this needs to live in LinearAlgebra and cannot live in fastmath).

1 Like

norm doesn’t overflow/underflow, it uses this algorithm: Pythagorean addition - Wikipedia

julia> norm([1e-300, 0])
1.0e-300

julia> foo1([1e-300, 0])
0.0

julia> norm([1e300, 0])
1.0e300

julia> foo1([1e300, 0])
Inf
5 Likes

It appears that much of the extra time supports numeric-type generic norm()s.
So, it may be that better dispatch direction could remove some of that for the most common numerical types (Float64, Float32 and Int32, Int64 which are converted to floating point when calculating norms).

No, the main cost is that the data is iterated over twice (once to check whether to rescale, and once to compute the sum). Consider the whacky modification:

julia> using BenchmarkTools, LinearAlgebra
julia> A=rand(5,5);
julia> function foo1(A) 
         x = zero(eltype(A))
         for v in A
           x += v * v
         end
         sqrt(x)
       end

julia> function generic_norm2_(x)
              T = promote_type(Float64, eltype(x))
              sum_ = zero(T)
              max_ = T(-Inf)
              for el in x
              sum_ += LinearAlgebra.norm_sqr(el)
              max_ = ifelse(max_ > norm(el), max_, norm(el))
              end
              iszero(max_) && return zero(eltype(x))
              ( iszero(max_*max_) || !isfinite(max_*max_*length(x)) ) && return norm(x)
              return convert(eltype(x), sqrt(sum_))
              end

julia> @btime norm($A);
  71.419 ns (0 allocations: 0 bytes)

julia> @btime foo1($A);
  23.145 ns (0 allocations: 0 bytes)

julia> @btime generic_norm2_($A);
  30.524 ns (0 allocations: 0 bytes)

julia> under_=[1e-300, 0.0]; over_=[1e+300, 0.0];

julia> norm(under_), foo1(under_), generic_norm2_(under_)
(1.0e-300, 0.0, 1.0e-300)

julia> norm(over_), foo1(over_), generic_norm2_(over_)
(1.0e300, Inf, 1.0e300)

So the loops should probably be fused, since the hypot-case is very unlikely.

2 Likes

Ahh… It does make sense to test after the calculation to determine if the tight loop version of norm failed. As you say, this is a rare occurrence, usually. Even with an application that overflows the tight loop norm half of the time, the savings will out.

Is the same thing happening with the current implementation of dot?

Yeah it’s hitting the generic norm because it’s a small matrix, instead of using BLAS:

julia> mynorm2(A) = sqrt(sum(abs2,A))
mynorm2 (generic function with 1 method)

julia> @btime mynorm2($A)
  16.075 ns (0 allocations: 0 bytes)
3.309649744948491

julia> @btime LinearAlgebra.generic_norm2($A)
  44.426 ns (0 allocations: 0 bytes)
3.309649744948491

julia> @btime LinearAlgebra.BLAS.nrm2($A)
  11.106 ns (0 allocations: 0 bytes)
3.309649744948491                  

The length cutoff, defined here is 32. Maybe it should be lowered.

A function dedicated for dense matrices could also be made.

1 Like

No.

My above whacky norm variant does not emit good code for large vectors; I’ll see whether I can get a PR that speeds this up.

Nevertheless, I personally hate it and think that stuff like norm should be fast and simple, with predictable quirks like norm([0.0,1e300])==Inf) by default. It should be the user’s job to decide whether to call hypot instead; and the stdlib authors are the ones whose job it should be to figure out a way to coax the compiler into emitting optimal code. I certainly cannot say whether a loop will compile well without reading @code_native.

Hah, thanks!

So no reason to make a PR, since this affects tiny vectors only, and people caring about speed on tiny vectors presumably use SVector already.

That being said, the obvious fast variant still outperforms the BLAS on my system (which presumably is a misconfiguration / build issue).

julia> A=rand(10_000);
julia> function foo2(A) 
                x = zero(eltype(A))
                @inbounds  @simd for v in A
                  @fastmath x += v * v
                end
                @fastmath sqrt(x)
              end
julia> @btime norm($A)
  4.643 μs (0 allocations: 0 bytes)
57.51021904090062

julia> @btime foo2($A)
  1.376 μs (0 allocations: 0 bytes)
57.51021904090062

I read this and thought “Of course norm(A) is slower than computing the sqrt of the sum of squares of the elements. norm(A) is the matrix 2-norm, which requires an SVD.”

But it isn’t! For norm(A) Julia computes the Frobenius norm, the sqrt sum of squares. Seems like a strange design choice to me, but maybe on the other handit can be counted as a case of choosing a more efficient but arguably less useful algorithm.

This decision was made pretty recently (for 0.7/1.0), as most people use this as a way to check if matrices are approximately equal, in which case it is usually sufficient and much faster. The operator norm is now called opnorm.

2 Likes

Yeah @simd really helps here compared to BLAS (not sure why BLAS isn’t using simd though).

Note that you don’t need this whole function to get that kind of speed, the above definition mynorm2(A) = sqrt(sum(abs2,A)) gets you most of the way:

julia> @btime mynorm2($A)          
  1.812 μs (0 allocations: 0 bytes)
57.68886221802906                  
                                   
julia> @btime foo2($A)             
  1.432 μs (0 allocations: 0 bytes)
57.68886221802905                  

(OK I suppose that’s a 20% difference…)

1 Like

So it seems like the first rule of getting a fast norm, is not to use norm. Curiously enough while sqrt(sum(abs2,A)) is normally faster, right at 32 elements the timing for norm plunges and beats the sum for a little while. Thankfully rolling your own takes less macro decoration, and we have two good alternatives to play with: @simd and @avx.

jnorm(A) = norm(A)
jsumnorm(A) = sqrt( sum( abs2, A ))

function jsimdnorm(A)
	∑= zero(eltype(A))
	@simd for i ∈ A
		∑ += i * i
	end
	sqrt( ∑ )
end

function javxnorm(A)
	∑= zero(eltype(A))
	@avx for i ∈ eachindex(A)
		∑ += A[i] * A[i]
	end
	sqrt( ∑ )
end

Running a simple benchmark for 128 and 1024 elements on a lower power Ryzen 5 (julia 1.5.3) gets these times:

             128         1024
jnorm      65.600 ns   513.800 ns
jsumnorm   38.200 ns   125.200 ns
jsimdnorm  22.200 ns   108.200 ns
javxnorm   16.400 ns    77.400 ns

You can see my full notebook here My appologies that the calculations for the final chart take several minutes to collect.

3 Likes

The reason that norm is slow might be that can be significantly more accurate than sqrt(sum(abs2, x)) when elements of x span many orders of magnitude.

5 Likes

I get that, as it was mentioned before. Just pointing out that the cost of that optional accuracy (even in cases where you don’t need it or want it), can be upwards of 4-6X these days.

If norm is that much slower, it would be interesting to see if we could get better performance with extended precision. All it would require is doing the accumulation (not the squaring) in twice precision, which would be a change from 2 operations per element (technically 1 with fma) to 5 (4 with fma). For 32 bit and smaller datatypes, it could probably done even more cheaply by just using single operations with the wider type.

4 Likes

Good reminder. The code for the norm using this is just jcompnorm(A) = sqrt(dot_oro(A,A)) which should be really fast. (package dependencies prevent me from testing this easily though).

1 Like

Doing the sum in extended precision doesn’t save sqrt(x^2 + y^2) from overflow or underflow of the squares. E.g.

julia> using LinearAlgebra

julia> x=3e-200; y = 4e-200;

julia> norm([x; y])
5.0e-200

julia> sqrt(x^2 + y^2)
0.0

julia> sqrt(big(x^2) + big(y^2))
0.0

You need to convert to big before squaring.

1 Like