50x speed difference in gemv for different values in vector

bastikr · March 19, 2017, 11:57am

I’m getting a little bit surprising benchmark results for multiplying certain matrices with vectors. Just changing the values stored in the vector changes the time needed for the multiplication by a factor of 50.

The test setup is pretty simple:

A is a 101x101 Matrix
v1 and v2 are Vectors of length 101
All entries are Complex128
The test data is stored in data and can be used with the following code

using BenchmarkTools
using JLD

d = load("data.jld")
A = d["A"]
v1 = d["v1"]
v2 = d["v2"]

result1 = @benchmark $A * $v1
result2 = @benchmark $A * $v2

println(result1)
println(result2)

Running this code leads for me consistently to the result

Trial(23.188 μs)
Trial(1.028 ms)

I have the following questions:

Am I doing the benchmark correctly?
Is this result reproducible by anyone else?
Is there a theoretical reason why changing the values should influence the speed of the multiplication?

kristoffer.carlsson · March 19, 2017, 12:26pm

Subnormals are the devil:
http://docs.julialang.org/en/release-0.5/manual/performance-tips/#treat-subnormal-numbers-as-zeros

julia> @btime $A * $v1;
  3.449 μs (2 allocations: 1.78 KiB)

julia> @btime $A * $v2;
  82.263 μs (2 allocations: 1.78 KiB)

julia> set_zero_subnormals(true)
true

julia> @btime $A * $v1;
  3.385 μs (2 allocations: 1.78 KiB)

julia> @btime $A * $v2;
  3.384 μs (2 allocations: 1.78 KiB)

StefanKarpinski · March 19, 2017, 3:13pm

Just to elaborate on that, subnormals (aka denormals) are floating-point values with large negative exponents – so large that they no longer use all the bits of the value and have less than full precision. This allows something known as “gradual underflow” where you lose precision gradually, instead of immediately getting a zero value. Doing arithmetic with subnormal values does not go through the normal CPU pathways (on Intel hardware) and thus takes considerably longer – i.e. floating-point ops do not take a fixed number of clock cycles, which is what you’re seeing here.

yuyichao · March 19, 2017, 3:59pm

A short note that this is architecture/implementation dependent. All Intel CPUs I’ve seen has the problem. Not sure about AMD x86 cores. None of ARMv7 and AArch64 cores have this problem.

It’s funny that intel originally proposed the gradual underflow (AFAIK) and now they have the slowest implementation dealing with them…

bastikr · March 20, 2017, 8:03am

I know that I read the performance tip about subnormal numbers at some point but I didn’t know what they are and therefor didn’t really associate this problem with it. Great to learn something new and thank you all very much for your help!

Palli · August 16, 2017, 9:45am

ARMv7 and before never supported subnormal/denormal numbers, so they didn’t have the “[performance] problem” as they didn’t try to “deal”; they had “gradual underflow” problem. Since ARMv8 they are fully IEEE compliant but if I recall not my default:

"To permit this optimization, ARM floating-point implementations have a special processing mode called Flush-to-zero mode. AArch32 Advanced SIMD floating-point instructions always use Flush-to-zero mode.

Behavior in Flush-to-zero mode differs from normal IEEE 754 arithmetic in the following ways:"

and

"Flush to Zero mode. Indicates whether the VFP hardware implementation supports only the Flush-to-Zero mode of operation. Permitted values are:

[…]

0001 Hardware supports full denormalized number arithmetic."

infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0360f/CJAIJAIJ.html

If anybody cares to know I also found this:

github.com/WebAssembly/simd

Allow flushing of subnormals in floating point SIMD operations

opened 04:47PM - 14 Apr 17 UTC

closed 12:42AM - 14 Dec 18 UTC

stoklund

The proposal in #1 includes this text: > An implementation is allowed to flus…h subnormals in arithmetic floating-point > operations. This means that any subnormal operand is treated as 0, and any > subnormal result is rounded to 0. > > Note that this differs from WebAssembly scalar floating-point semantics which > require correct subnormal handling. The issue is also mentioned in the [future features](https://github.com/WebAssembly/design/blob/master/FutureFeatures.md#flushing-subnormal-values-to-zero) design document. The practical issue for SIMD is 32-bit ARM devices: The ARMv7 ISA has two instruction sets for floating point, VFP and NEON. VFP provides scalar floating point instructions with full support for IEEE 754 subnormal values. NEON provides 64-bit and 128-bit SIMD floating point instructions *that only have flush-to-zero semantics for subnormal numbers*. The same is true of the AArch32 mode of ARMv8. Only AArch64 supports subnormal values in SIMD instructions. In summary, if we want to run floating-point SIMD code on 32-bit ARM devices (and 64-bit ARM devices running in 32-bit mode) we need to allow for subnormal values to be flushed to zero.

Elrod · August 16, 2017, 10:17am

Also strikes me as funny because in general my 2.7 GHz Intel is often about 50% faster than my 3.6 GHz AMD in benchmarks.
Almost surprising to hear they’re actually worse per clock cycle in some operations.

yuyichao · August 16, 2017, 12:09pm

This is wrong. They do. NEON/ASIMD don’t, which is why it’s not the default FPU.

They do have the “correct” trade off in most cases since very few calculation actually need subnormal numbers.

Palli · August 16, 2017, 3:07pm

It seems you’re right about e.g. Cortex-A7, I found they have hardware support for normal and subnormal. Maybe I’m not reading the manuals right about if you can’t rely on in ARMv7, but certainly if you go to an old enough ARM arch, then you have no FPU at all

http://liris.cnrs.fr/~mmrissa/lib/exe/fetch.php?media=armv7-a-r-manual.pdf ARMv7 manual
"Indicates whether the Floating-point Extension hardware implementation supports only the Flush-to-Zero mode of operation. Permitted values are:

0b0000 Hardware supports only the Flush-to-Zero mode of operation. If a VFP subarchitecture is implemented its support code might include support for full denormalized number arithmetic.
0b0001 Hardware supports full denormalized number arithmetic."

The FPU sitation over the years has been all over the place… In general you can’t rely in denormals, e.g. for Sun:

“IEEE 754 says nothing about a flush-to-zero mode to handle denormalized numbers faster, some architectures offer this mode (e.g. http://docs.sun.com/source/806-3568/ncg_lib.html ).
[…]
There are platforms that support flush-to-zero only, and there are many platforms where flush-to-zero is the default.
[…]
ARM Cortex cores have a flush to zero option, hard to see how you can ignore it. Then again, don’t take business advice from a forum.”

“For instance, the VFP11 coprocessor does not process subnormal input values directly. To provide correct handling of subnormal inputs according to the IEEE 754 standard, a trap is made to support code to process the operation”

Topic		Replies	Views
Slowdown due to subnormal float, coming from neural net training Performance	20	921	October 27, 2022
Feedback on benchmark General Usage	0	245	January 10, 2020
Massive data-dependent floating-point slowdown Performance	3	691	May 28, 2021
Subtract Float32 number from Float64 number - what's the rule? New to Julia	13	1285	August 31, 2023
@inbounds: is the compiler now so smart that this is no longer necessary? Performance	33	3012	July 16, 2018

50x speed difference in gemv for different values in vector

Related topics