Slowdown due to subnormal float, coming from neural net training

Fabrice_Rosay · October 26, 2022, 9:28am

Here is a minimal working example

function mygemavx!(C, A, B, bias)
		@turbo for m ∈ axes(A,1)
			  Cmk = bias[m]
			  for k ∈ axes(A,2)
				  Cmk += A[m,k] * B[k]
			  end
			  C[m] = Cmk
		  end
		  C
	  end

A1=zeros(Float32,32,128)
A2=zeros(Float32,32,128).+1f-40
B=rand(Float32,128)
b=rand(Float32,32)
C=rand(Float32,32)

@btime mygemmavx!($C,$A1,$B,$b)

@btime mygemmavx!($C,$A2,$B,$b)

wich gives: 114.423 ns (0 allocations: 0 bytes)
14.678 μs (0 allocations: 0 bytes)

I don’t understand where this come from. I came across this by training weights for a neural network.

Julia Version 1.8.1
Commit afb6c60d69a (2022-09-06 15:09 UTC)
Platform Info:
OS: Linux (x86_64-linux-gnu)
CPU: 16 × Intel(R) Core™ i7-10700KF CPU @ 3.80GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-13.0.1 (ORCJIT, skylake)
Threads: 8 on 16 virtual cores

DNF · October 26, 2022, 9:56am

This is probably unrelated, but you seem to have the slowest axis in the innermost loop. Can you rewrite to change the iteration order?

Fabrice_Rosay · October 26, 2022, 10:32am

I don’t think that would explain the time discrepancy, beside to my understanding @turbo reorder loop.

Anyway:

		@turbo for m ∈ axes(A,2)
			  Cmk = bias[m]
			  for k ∈ axes(A,1)
				  Cmk += A[k,m] * B[k]
			  end
			  C[m] = Cmk
		  end
		  C
	  end

A1=zeros(Float32,128,32)
A2=zeros(Float32,128,32).+1f-40
B=rand(Float32,128)
b=rand(Float32,32)
C=rand(Float32,32)

@btime mygemavx!($C,$A1,$B,$b)

@btime mygemavx!($C,$A2,$B,$b)

Gives exactly the same result:

109.105 ns (0 allocations: 0 bytes)
15.518 μs (0 allocations: 0 bytes)

DNF · October 26, 2022, 10:33am

Only if it can make sure it’s equivalent. Not sure how clever it is.

carstenbauer · October 26, 2022, 10:48am

Can’t reproduce on M1 Mac Mini:

julia> @btime mygemmavx!($C,$A1,$B,$b);
  158.794 ns (0 allocations: 0 bytes)

julia> @btime mygemmavx!($C,$A2,$B,$b);
  157.916 ns (0 allocations: 0 bytes)

julia> versioninfo()
Julia Version 1.8.2
Commit 36034abf260 (2022-09-29 15:21 UTC)
Platform Info:
  OS: macOS (arm64-apple-darwin21.3.0)
  CPU: 8 × Apple M1
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-13.0.1 (ORCJIT, apple-m1)
  Threads: 6 on 4 virtual cores
Environment:
  JULIA_EDITOR = subl
  JULIA_NUM_THREADS = 6
  JULIA_PKG_SERVER = eu-central.pkg.julialang.org

(jl_DiqOoj) pkg> st
Status `/private/var/folders/qk/nm34_hqn7_7g8fvxgpg24gm40000gn/T/jl_DiqOoj/Project.toml`
  [bdcacae8] LoopVectorization v0.12.136

(FWIW, this is with the wrong loop order.)

carstenbauer · October 26, 2022, 10:56am

Can reproduce on an Intel (Skylake) cluster:

julia> @btime mygemmavx!($C,$A1,$B,$b);
  157.162 ns (0 allocations: 0 bytes)

julia> @btime mygemmavx!($C,$A2,$B,$b);
  10.422 μs (0 allocations: 0 bytes)

julia> versioninfo()
Julia Version 1.8.2
Commit 36034abf260 (2022-09-29 15:21 UTC)
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 40 × Intel(R) Xeon(R) Gold 6148F CPU @ 2.40GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-13.0.1 (ORCJIT, skylake-avx512)
  Threads: 1 on 40 virtual cores
Environment:
  JULIA_EDITOR = vim

(jl_Z8sFfb) pkg> st
Status `/tmp/jl_Z8sFfb/Project.toml`
  [6e4b80f9] BenchmarkTools v1.3.1
  [bdcacae8] LoopVectorization v0.12.136

Can not reproduce on AMD (Zen 2) cluster. (Also Zen 3 doesn’t show the issue.)

julia> @btime mygemmavx!($C,$A1,$B,$b);
  209.415 ns (0 allocations: 0 bytes)

julia> @btime mygemmavx!($C,$A2,$B,$b);
  209.505 ns (0 allocations: 0 bytes)

julia> versioninfo()
Julia Version 1.8.2
Commit 36034abf260 (2022-09-29 15:21 UTC)
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 128 × AMD EPYC 7742 64-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-13.0.1 (ORCJIT, znver2)
  Threads: 1 on 128 virtual cores
Environment:
  JULIA_EDITOR = vim

(jl_ZQZn8m) pkg> st
Status `/tmp/jl_ZQZn8m/Project.toml`
  [6e4b80f9] BenchmarkTools v1.3.1
  [bdcacae8] LoopVectorization v0.12.136

lmiq · October 26, 2022, 11:04am

I can also reproduce it here in my laptop, Intel i7.

mikkoku · October 26, 2022, 11:52am

This might be because 1f-40 is subnormal. Changing it to 1f-36 made the difference in timing disappear for me.

Fabrice_Rosay · October 26, 2022, 11:56am

For me too, changing to 1f-36 make the time difference disappear. Maybe then Flux should at least warn or prevent this to happen? I don’t really know about subnormal numbers.

Oscar_Smith · October 26, 2022, 12:12pm

the tldr is that the smallest floating point numbers have a slightly different representation and Intel cpus fall back to horribly slow paths when you have them. you can set a flag that will truncate them to zero if you don’t mind the accuracy hit.

Fabrice_Rosay · October 26, 2022, 12:22pm

So I should change the title to something else, as it has nothing to
do with loopvectorisation.

lmiq · October 26, 2022, 1:27pm

Is it impossible to reach a subnormal number in a iterative algorithm that starts with “normal” numbers? Or can we get these unpredictable slowdowns when running any calculation?

ps: it seems it is possible (there is no rounding to zero here):

julia> x = 1.f-20
1.0f-20

julia> x^2
1.0f-40

Does the same may happen with 64 floats?

Fabrice_Rosay · October 26, 2022, 1:32pm

It happened to weights of neural networks that i initialize with 0.001f0*randn, so probably all were “normal” at start. I noticed because suddenly it went very slow and i couldn’t figure out why (i use those nets in alpha beta so the slow down was evident) so it is definitely possible and hard to guess if you don’t already know it could be an issue.

Oscar_Smith · October 26, 2022, 1:38pm

As you’ve seen, subnormals can appear from calculations with normal floats (otherwise they wouldn’t be useful). The same happens with 64 bit floats, but the cutoff for Float64 is 2^-1022 (roughly 2e-308) so you are a lot less likely to hit them accidentally. If this is a problem you can run set_zero_subnormals(true) which will tell your processor to round all subnormals to zero (for the current thread). Note that doing this can slightly break math as for example, with subnormals rounded to zero, you lose the property that x-y==0 iff x==y.

lmiq · October 26, 2022, 2:30pm

Nice, thanks. For the record, I had to see that for myself, as being something independent on the compiler, and language. This is with Fortran:

julia> @btime subnormal_fortran!($C,$A1,$B,$b);
  8.306 μs (0 allocations: 0 bytes)

julia> @btime subnormal_fortran!($C,$A2,$B,$b);
  117.794 μs (0 allocations: 0 bytes)

Where the corresponding codes are these:

subnormal.f90

subroutine subnormal(C, A, B, bias)
    implicit none
    integer :: m, k
    integer, parameter :: dp = kind(0.e0)
    real(dp) :: A(32,128), B(128), bias(32), C(32)
    real(dp) :: Cmk
    do m = 1, 32
        Cmk = bias(m)
        do k = 1, 128
            Cmk = Cmk + A(m,k) * B(k)
        end do
        C(m) = Cmk
    end do
end subroutine subnormal

Compile with gfortran subnormal.f90 -shared -o subnormal.so

Benchmark within Julia by defining:

julia> subnormal_fortran!(C,A,B,b) = ccall(
           (:subnormal_,"./subnormal.so"),
           Nothing,
           (Ref{Float32}, Ref{Float32}, Ref{Float32}, Ref{Float32}),
           C, A, B, b
       )

kristoffer.carlsson · October 26, 2022, 2:37pm

https://docs.julialang.org/en/v1/manual/performance-tips/#Treat-Subnormal-Numbers-as-Zeros

Palli · October 26, 2022, 2:45pm

I see was answered while I was writing this comment:

Some CPUs aren’t IEEE-conformant and don’t have subnormals (because NOT having them can prevent slowdown, and the hardware for them is more complex, or possibly not there, implemented in software), or at least have it as an option (and GPUs I guess). I’m not sure if x86 has both ways, is there a way in Julia to disable support for subnormals, in case the hardware supports that?

gbaraldi · October 26, 2022, 4:45pm

I believe most popular archictectures have an option to change this, via a register or a flag. Which has lead to some very fun issues. benchmarking ("exponent", "subnorm", "Float32") produced the DomainError when ApproxFun package is also used · Issue #253 · JuliaCI/BaseBenchmarks.jl · GitHub

arch.d.robison · October 27, 2022, 12:59am

Subnormals show up a lot in some numerical methods. I recall William Kahan, one of the key forces for IEEE-754 arithmetic, mentioning a solution to the performance problem. He gave the example of a heat-flow solver, with positive-temperature boundary conditions. He noted that initially setting the interior to absolute zero was physically unrealistic – there must be some thermal noise! So add some to the numerical model. I took the advice in this code by injecting noise that cleared up subnormal slowdown problems.

Elrod · October 27, 2022, 9:14am

That’s equivalent to passing in A' to your original version, which I’d have expected to be much slower. I am surprised that this isn’t what your timings show.

It should be pretty clever, and will reorder loops differently depending on whether or not the input arrays are transposed, for example, so that you only need to write one version of your functions.

Comparing to the gfortran timing above isn’t fair without at least -O3 -march=native, though.
I believe -Ofast would be zero subnormals (which LV does not do; the user controls it via Base.set_zero_subnormals(true)).

Topic		Replies	Views
50x speed difference in gemv for different values in vector General Usage	8	4952	August 16, 2017
Massive data-dependent floating-point slowdown Performance	3	676	May 28, 2021
Subtract Float32 number from Float64 number - what's the rule? New to Julia	13	1226	August 31, 2023
Feedback on benchmark General Usage	0	242	January 10, 2020
Same code run multiple times gives wildly different timings Performance performance , parallel , loopvectorization	39	1902	February 24, 2022

Slowdown due to subnormal float, coming from neural net training

Related topics