Performance of norm function

tk3369 · September 8, 2018, 7:03pm

Custom for-loop seems 3x faster than LinearAlgebra’s norm…

julia> using LinearAlgebra, BenchmarkTools

julia> A = rand(5,5);

julia> function foo1(A) 
         x = zero(eltype(A))
         for v in A
           x += v * v
         end
         sqrt(x)
       end
foo1 (generic function with 1 method)

julia> @btime norm($A)
  55.623 ns (0 allocations: 0 bytes)
3.349182349398459

julia> @btime foo1($A)
  18.336 ns (0 allocations: 0 bytes)
3.349182349398459

foobar_lv2 · September 8, 2018, 7:27pm

Hmm. That is almost to be expected, seeing that lots of basic math operations are implemented in a very slow way to squeeze out a little more precision (see: min and max for NaN-handling, abs for complex numbers, the crazyness that is complex division). While I personally hate this fact, it has a certain consistency.

More annoyingly f(x)=@fastmath norm(x) refuses to do the right thing.

If one were to fix this, i.e. write a fastmath-norm, how would one do this? (the problem preventing me from writing short PR is that this needs to live in LinearAlgebra and cannot live in fastmath).

giordano · September 8, 2018, 7:32pm

norm doesn’t overflow/underflow, it uses this algorithm: Pythagorean addition - Wikipedia

julia> norm([1e-300, 0])
1.0e-300

julia> foo1([1e-300, 0])
0.0

julia> norm([1e300, 0])
1.0e300

julia> foo1([1e300, 0])
Inf

JeffreySarnoff · September 8, 2018, 7:54pm

It appears that much of the extra time supports numeric-type generic norm()s.
So, it may be that better dispatch direction could remove some of that for the most common numerical types (Float64, Float32 and Int32, Int64 which are converted to floating point when calculating norms).

foobar_lv2 · September 8, 2018, 8:12pm

No, the main cost is that the data is iterated over twice (once to check whether to rescale, and once to compute the sum). Consider the whacky modification:

julia> using BenchmarkTools, LinearAlgebra
julia> A=rand(5,5);
julia> function foo1(A) 
         x = zero(eltype(A))
         for v in A
           x += v * v
         end
         sqrt(x)
       end

julia> function generic_norm2_(x)
              T = promote_type(Float64, eltype(x))
              sum_ = zero(T)
              max_ = T(-Inf)
              for el in x
              sum_ += LinearAlgebra.norm_sqr(el)
              max_ = ifelse(max_ > norm(el), max_, norm(el))
              end
              iszero(max_) && return zero(eltype(x))
              ( iszero(max_*max_) || !isfinite(max_*max_*length(x)) ) && return norm(x)
              return convert(eltype(x), sqrt(sum_))
              end

julia> @btime norm($A);
  71.419 ns (0 allocations: 0 bytes)

julia> @btime foo1($A);
  23.145 ns (0 allocations: 0 bytes)

julia> @btime generic_norm2_($A);
  30.524 ns (0 allocations: 0 bytes)

julia> under_=[1e-300, 0.0]; over_=[1e+300, 0.0];

julia> norm(under_), foo1(under_), generic_norm2_(under_)
(1.0e-300, 0.0, 1.0e-300)

julia> norm(over_), foo1(over_), generic_norm2_(over_)
(1.0e300, Inf, 1.0e300)

So the loops should probably be fused, since the hypot-case is very unlikely.

JeffreySarnoff · September 8, 2018, 8:19pm

Ahh… It does make sense to test after the calculation to determine if the tight loop version of norm failed. As you say, this is a rare occurrence, usually. Even with an application that overflows the tight loop norm half of the time, the savings will out.

Is the same thing happening with the current implementation of dot?

jebej · September 8, 2018, 8:31pm

Yeah it’s hitting the generic norm because it’s a small matrix, instead of using BLAS:

julia> mynorm2(A) = sqrt(sum(abs2,A))
mynorm2 (generic function with 1 method)

julia> @btime mynorm2($A)
  16.075 ns (0 allocations: 0 bytes)
3.309649744948491

julia> @btime LinearAlgebra.generic_norm2($A)
  44.426 ns (0 allocations: 0 bytes)
3.309649744948491

julia> @btime LinearAlgebra.BLAS.nrm2($A)
  11.106 ns (0 allocations: 0 bytes)
3.309649744948491

The length cutoff, defined here is 32. Maybe it should be lowered.

A function dedicated for dense matrices could also be made.

foobar_lv2 · September 8, 2018, 8:43pm

No.

My above whacky norm variant does not emit good code for large vectors; I’ll see whether I can get a PR that speeds this up.

Nevertheless, I personally hate it and think that stuff like norm should be fast and simple, with predictable quirks like norm([0.0,1e300])==Inf) by default. It should be the user’s job to decide whether to call hypot instead; and the stdlib authors are the ones whose job it should be to figure out a way to coax the compiler into emitting optimal code. I certainly cannot say whether a loop will compile well without reading @code_native.

foobar_lv2 · September 8, 2018, 8:52pm

Hah, thanks!

So no reason to make a PR, since this affects tiny vectors only, and people caring about speed on tiny vectors presumably use SVector already.

That being said, the obvious fast variant still outperforms the BLAS on my system (which presumably is a misconfiguration / build issue).

julia> A=rand(10_000);
julia> function foo2(A) 
                x = zero(eltype(A))
                @inbounds  @simd for v in A
                  @fastmath x += v * v
                end
                @fastmath sqrt(x)
              end
julia> @btime norm($A)
  4.643 μs (0 allocations: 0 bytes)
57.51021904090062

julia> @btime foo2($A)
  1.376 μs (0 allocations: 0 bytes)
57.51021904090062

John_Gibson · September 8, 2018, 8:52pm

I read this and thought “Of course norm(A) is slower than computing the sqrt of the sum of squares of the elements. norm(A) is the matrix 2-norm, which requires an SVD.”

But it isn’t! For norm(A) Julia computes the Frobenius norm, the sqrt sum of squares. Seems like a strange design choice to me, but maybe on the other handit can be counted as a case of choosing a more efficient but arguably less useful algorithm.

jebej · September 8, 2018, 8:59pm

This decision was made pretty recently (for 0.7/1.0), as most people use this as a way to check if matrices are approximately equal, in which case it is usually sufficient and much faster. The operator norm is now called opnorm.

jebej · September 8, 2018, 9:04pm

Yeah @simd really helps here compared to BLAS (not sure why BLAS isn’t using simd though).

Note that you don’t need this whole function to get that kind of speed, the above definition mynorm2(A) = sqrt(sum(abs2,A)) gets you most of the way:

julia> @btime mynorm2($A)          
  1.812 μs (0 allocations: 0 bytes)
57.68886221802906                  
                                   
julia> @btime foo2($A)             
  1.432 μs (0 allocations: 0 bytes)
57.68886221802905

(OK I suppose that’s a 20% difference…)

woolstar · January 19, 2021, 2:30am

So it seems like the first rule of getting a fast norm, is not to use norm. Curiously enough while sqrt(sum(abs2,A)) is normally faster, right at 32 elements the timing for norm plunges and beats the sum for a little while. Thankfully rolling your own takes less macro decoration, and we have two good alternatives to play with: @simd and @avx.

jnorm(A) = norm(A)
jsumnorm(A) = sqrt( sum( abs2, A ))

function jsimdnorm(A)
	∑= zero(eltype(A))
	@simd for i ∈ A
		∑ += i * i
	end
	sqrt( ∑ )
end

function javxnorm(A)
	∑= zero(eltype(A))
	@avx for i ∈ eachindex(A)
		∑ += A[i] * A[i]
	end
	sqrt( ∑ )
end

Running a simple benchmark for 128 and 1024 elements on a lower power Ryzen 5 (julia 1.5.3) gets these times:

             128         1024
jnorm      65.600 ns   513.800 ns
jsumnorm   38.200 ns   125.200 ns
jsimdnorm  22.200 ns   108.200 ns
javxnorm   16.400 ns    77.400 ns

You can see my full notebook here My appologies that the calculations for the final chart take several minutes to collect.

baggepinnen · January 19, 2021, 10:58am

The reason that norm is slow might be that can be significantly more accurate than sqrt(sum(abs2, x)) when elements of x span many orders of magnitude.

woolstar · January 21, 2021, 6:42am

I get that, as it was mentioned before. Just pointing out that the cost of that optional accuracy (even in cases where you don’t need it or want it), can be upwards of 4-6X these days.

Oscar_Smith · January 21, 2021, 7:10am

If norm is that much slower, it would be interesting to see if we could get better performance with extended precision. All it would require is doing the accumulation (not the squaring) in twice precision, which would be a change from 2 operations per element (technically 1 with fma) to 5 (4 with fma). For 32 bit and smaller datatypes, it could probably done even more cheaply by just using single operations with the wider type.

LaurentPlagne · January 21, 2021, 7:15am

Oscar_Smith · January 21, 2021, 7:41am

Good reminder. The code for the norm using this is just jcompnorm(A) = sqrt(dot_oro(A,A)) which should be really fast. (package dependencies prevent me from testing this easily though).

John_Gibson · March 28, 2021, 1:03pm

Doing the sum in extended precision doesn’t save sqrt(x^2 + y^2) from overflow or underflow of the squares. E.g.

julia> using LinearAlgebra

julia> x=3e-200; y = 4e-200;

julia> norm([x; y])
5.0e-200

julia> sqrt(x^2 + y^2)
0.0

julia> sqrt(big(x^2) + big(y^2))
0.0

Oscar_Smith · March 28, 2021, 1:09pm

You need to convert to big before squaring.

Topic		Replies	Views
Orders of magnitude runtime difference in row-wise norm Performance	5	364	March 21, 2023
Faster isapprox() General Usage question , performance , linearalgebra , loopvectorization , preallocation	14	713	July 5, 2023
How to make norm(::AbstractVector) faster in my case? Performance	14	567	December 28, 2021
Optimizing sums of products (dot products) Performance simd , linearalgebra , sum	17	626	September 24, 2024
Improving function performance - Broadcasting vs Loops Performance	14	1992	August 17, 2021

Performance of norm function

Related topics