I’m getting a little bit surprising benchmark results for multiplying certain matrices with vectors. Just changing the values stored in the vector changes the time needed for the multiplication by a factor of 50.
The test setup is pretty simple:
A is a 101x101 Matrix
v1 and v2 are Vectors of length 101
All entries are Complex128
The test data is stored in data and can be used with the following code
d = load("data.jld")
A = d["A"]
v1 = d["v1"]
v2 = d["v2"]
result1 = @benchmark $A * $v1
result2 = @benchmark $A * $v2
Running this code leads for me consistently to the result
I have the following questions:
Am I doing the benchmark correctly?
Is this result reproducible by anyone else?
Is there a theoretical reason why changing the values should influence the speed of the multiplication?
Just to elaborate on that, subnormals (aka denormals) are floating-point values with large negative exponents – so large that they no longer use all the bits of the value and have less than full precision. This allows something known as “gradual underflow” where you lose precision gradually, instead of immediately getting a zero value. Doing arithmetic with subnormal values does not go through the normal CPU pathways (on Intel hardware) and thus takes considerably longer – i.e. floating-point ops do not take a fixed number of clock cycles, which is what you’re seeing here.
I know that I read the performance tip about subnormal numbers at some point but I didn’t know what they are and therefor didn’t really associate this problem with it. Great to learn something new and thank you all very much for your help!
ARMv7 and before never supported subnormal/denormal numbers, so they didn’t have the “[performance] problem” as they didn’t try to “deal”; they had “gradual underflow” problem. Since ARMv8 they are fully IEEE compliant but if I recall not my default:
"To permit this optimization, ARM floating-point implementations have a special processing mode called Flush-to-zero mode. AArch32 Advanced SIMD floating-point instructions always use Flush-to-zero mode.
Behavior in Flush-to-zero mode differs from normal IEEE 754 arithmetic in the following ways:"
"Flush to Zero mode. Indicates whether the VFP hardware implementation supports only the Flush-to-Zero mode of operation. Permitted values are:
0001 Hardware supports full denormalized number arithmetic."
Also strikes me as funny because in general my 2.7 GHz Intel is often about 50% faster than my 3.6 GHz AMD in benchmarks.
Almost surprising to hear they’re actually worse per clock cycle in some operations.
It seems you’re right about e.g. Cortex-A7, I found they have hardware support for normal and subnormal. Maybe I’m not reading the manuals right about if you can’t rely on in ARMv7, but certainly if you go to an old enough ARM arch, then you have no FPU at all
0b0000 Hardware supports only the Flush-to-Zero mode of operation. If a VFP subarchitecture is implemented its support code might include support for full denormalized number arithmetic.
0b0001 Hardware supports full denormalized number arithmetic."
The FPU sitation over the years has been all over the place… In general you can’t rely in denormals, e.g. for Sun:
“IEEE 754 says nothing about a flush-to-zero mode to handle denormalized numbers faster, some architectures offer this mode (e.g. http://docs.sun.com/source/806-3568/ncg_lib.html ).
There are platforms that support flush-to-zero only, and there are many platforms where flush-to-zero is the default.
ARM Cortex cores have a flush to zero option, hard to see how you can ignore it. Then again, don’t take business advice from a forum.”
“For instance, the VFP11 coprocessor does not process subnormal input values directly. To provide correct handling of subnormal inputs according to the IEEE 754 standard, a trap is made to support code to process the operation”