Slowdown due to subnormal float, coming from neural net training

Here is a minimal working example

function mygemavx!(C, A, B, bias)
		@turbo for m ∈ axes(A,1)
			  Cmk = bias[m]
			  for k ∈ axes(A,2)
				  Cmk += A[m,k] * B[k]
			  end
			  C[m] = Cmk
		  end
		  C
	  end

A1=zeros(Float32,32,128)
A2=zeros(Float32,32,128).+1f-40
B=rand(Float32,128)
b=rand(Float32,32)
C=rand(Float32,32)

@btime mygemmavx!($C,$A1,$B,$b)

@btime mygemmavx!($C,$A2,$B,$b)

wich gives: 114.423 ns (0 allocations: 0 bytes)
14.678 μs (0 allocations: 0 bytes)

I don’t understand where this come from. I came across this by training weights for a neural network.

Julia Version 1.8.1
Commit afb6c60d69a (2022-09-06 15:09 UTC)
Platform Info:
OS: Linux (x86_64-linux-gnu)
CPU: 16 × Intel(R) Core™ i7-10700KF CPU @ 3.80GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-13.0.1 (ORCJIT, skylake)
Threads: 8 on 16 virtual cores

This is probably unrelated, but you seem to have the slowest axis in the innermost loop. Can you rewrite to change the iteration order?

I don’t think that would explain the time discrepancy, beside to my understanding @turbo reorder loop.

Anyway:

		@turbo for m ∈ axes(A,2)
			  Cmk = bias[m]
			  for k ∈ axes(A,1)
				  Cmk += A[k,m] * B[k]
			  end
			  C[m] = Cmk
		  end
		  C
	  end

A1=zeros(Float32,128,32)
A2=zeros(Float32,128,32).+1f-40
B=rand(Float32,128)
b=rand(Float32,32)
C=rand(Float32,32)

@btime mygemavx!($C,$A1,$B,$b)

@btime mygemavx!($C,$A2,$B,$b)

Gives exactly the same result:

109.105 ns (0 allocations: 0 bytes)
15.518 μs (0 allocations: 0 bytes)

Only if it can make sure it’s equivalent. Not sure how clever it is.

Can’t reproduce on M1 Mac Mini:

julia> @btime mygemmavx!($C,$A1,$B,$b);
  158.794 ns (0 allocations: 0 bytes)

julia> @btime mygemmavx!($C,$A2,$B,$b);
  157.916 ns (0 allocations: 0 bytes)

julia> versioninfo()
Julia Version 1.8.2
Commit 36034abf260 (2022-09-29 15:21 UTC)
Platform Info:
  OS: macOS (arm64-apple-darwin21.3.0)
  CPU: 8 × Apple M1
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-13.0.1 (ORCJIT, apple-m1)
  Threads: 6 on 4 virtual cores
Environment:
  JULIA_EDITOR = subl
  JULIA_NUM_THREADS = 6
  JULIA_PKG_SERVER = eu-central.pkg.julialang.org

(jl_DiqOoj) pkg> st
Status `/private/var/folders/qk/nm34_hqn7_7g8fvxgpg24gm40000gn/T/jl_DiqOoj/Project.toml`
  [bdcacae8] LoopVectorization v0.12.136

(FWIW, this is with the wrong loop order.)

Can reproduce on an Intel (Skylake) cluster:

julia> @btime mygemmavx!($C,$A1,$B,$b);
  157.162 ns (0 allocations: 0 bytes)

julia> @btime mygemmavx!($C,$A2,$B,$b);
  10.422 μs (0 allocations: 0 bytes)

julia> versioninfo()
Julia Version 1.8.2
Commit 36034abf260 (2022-09-29 15:21 UTC)
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 40 × Intel(R) Xeon(R) Gold 6148F CPU @ 2.40GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-13.0.1 (ORCJIT, skylake-avx512)
  Threads: 1 on 40 virtual cores
Environment:
  JULIA_EDITOR = vim

(jl_Z8sFfb) pkg> st
Status `/tmp/jl_Z8sFfb/Project.toml`
  [6e4b80f9] BenchmarkTools v1.3.1
  [bdcacae8] LoopVectorization v0.12.136

Can not reproduce on AMD (Zen 2) cluster. (Also Zen 3 doesn’t show the issue.)

julia> @btime mygemmavx!($C,$A1,$B,$b);
  209.415 ns (0 allocations: 0 bytes)

julia> @btime mygemmavx!($C,$A2,$B,$b);
  209.505 ns (0 allocations: 0 bytes)

julia> versioninfo()
Julia Version 1.8.2
Commit 36034abf260 (2022-09-29 15:21 UTC)
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 128 × AMD EPYC 7742 64-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-13.0.1 (ORCJIT, znver2)
  Threads: 1 on 128 virtual cores
Environment:
  JULIA_EDITOR = vim

(jl_ZQZn8m) pkg> st
Status `/tmp/jl_ZQZn8m/Project.toml`
  [6e4b80f9] BenchmarkTools v1.3.1
  [bdcacae8] LoopVectorization v0.12.136
2 Likes

I can also reproduce it here in my laptop, Intel i7.

This might be because 1f-40 is subnormal. Changing it to 1f-36 made the difference in timing disappear for me.

5 Likes

For me too, changing to 1f-36 make the time difference disappear. Maybe then Flux should at least warn or prevent this to happen? I don’t really know about subnormal numbers.

the tldr is that the smallest floating point numbers have a slightly different representation and Intel cpus fall back to horribly slow paths when you have them. you can set a flag that will truncate them to zero if you don’t mind the accuracy hit.

2 Likes

So I should change the title to something else, as it has nothing to
do with loopvectorisation.

Is it impossible to reach a subnormal number in a iterative algorithm that starts with “normal” numbers? Or can we get these unpredictable slowdowns when running any calculation?

ps: it seems it is possible (there is no rounding to zero here):

julia> x = 1.f-20
1.0f-20

julia> x^2
1.0f-40

Does the same may happen with 64 floats?

It happened to weights of neural networks that i initialize with 0.001f0*randn, so probably all were “normal” at start. I noticed because suddenly it went very slow and i couldn’t figure out why (i use those nets in alpha beta so the slow down was evident) so it is definitely possible and hard to guess if you don’t already know it could be an issue.

As you’ve seen, subnormals can appear from calculations with normal floats (otherwise they wouldn’t be useful). The same happens with 64 bit floats, but the cutoff for Float64 is 2^-1022 (roughly 2e-308) so you are a lot less likely to hit them accidentally. If this is a problem you can run set_zero_subnormals(true) which will tell your processor to round all subnormals to zero (for the current thread). Note that doing this can slightly break math as for example, with subnormals rounded to zero, you lose the property that x-y==0 iff x==y.

4 Likes

Nice, thanks. For the record, I had to see that for myself, as being something independent on the compiler, and language. This is with Fortran:

julia> @btime subnormal_fortran!($C,$A1,$B,$b);
  8.306 μs (0 allocations: 0 bytes)

julia> @btime subnormal_fortran!($C,$A2,$B,$b);
  117.794 μs (0 allocations: 0 bytes)

Where the corresponding codes are these:

subnormal.f90
subroutine subnormal(C, A, B, bias)
    implicit none
    integer :: m, k
    integer, parameter :: dp = kind(0.e0)
    real(dp) :: A(32,128), B(128), bias(32), C(32)
    real(dp) :: Cmk
    do m = 1, 32
        Cmk = bias(m)
        do k = 1, 128
            Cmk = Cmk + A(m,k) * B(k)
        end do
        C(m) = Cmk
    end do
end subroutine subnormal

Compile with gfortran subnormal.f90 -shared -o subnormal.so

Benchmark within Julia by defining:

julia> subnormal_fortran!(C,A,B,b) = ccall(
           (:subnormal_,"./subnormal.so"),
           Nothing,
           (Ref{Float32}, Ref{Float32}, Ref{Float32}, Ref{Float32}),
           C, A, B, b
       )

https://docs.julialang.org/en/v1/manual/performance-tips/#Treat-Subnormal-Numbers-as-Zeros

4 Likes

I see was answered while I was writing this comment:

Some CPUs aren’t IEEE-conformant and don’t have subnormals (because NOT having them can prevent slowdown, and the hardware for them is more complex, or possibly not there, implemented in software), or at least have it as an option (and GPUs I guess). I’m not sure if x86 has both ways, is there a way in Julia to disable support for subnormals, in case the hardware supports that?

I believe most popular archictectures have an option to change this, via a register or a flag. Which has lead to some very fun issues. benchmarking ("exponent", "subnorm", "Float32") produced the DomainError when ApproxFun package is also used · Issue #253 · JuliaCI/BaseBenchmarks.jl · GitHub

Subnormals show up a lot in some numerical methods. I recall William Kahan, one of the key forces for IEEE-754 arithmetic, mentioning a solution to the performance problem. He gave the example of a heat-flow solver, with positive-temperature boundary conditions. He noted that initially setting the interior to absolute zero was physically unrealistic – there must be some thermal noise! So add some to the numerical model. I took the advice in this code by injecting noise that cleared up subnormal slowdown problems.

3 Likes

That’s equivalent to passing in A' to your original version, which I’d have expected to be much slower. I am surprised that this isn’t what your timings show.

It should be pretty clever, and will reorder loops differently depending on whether or not the input arrays are transposed, for example, so that you only need to write one version of your functions.

Comparing to the gfortran timing above isn’t fair without at least -O3 -march=native, though.
I believe -Ofast would be zero subnormals (which LV does not do; the user controls it via Base.set_zero_subnormals(true)).

2 Likes