50x speed difference in gemv for different values in vector

I’m getting a little bit surprising benchmark results for multiplying certain matrices with vectors. Just changing the values stored in the vector changes the time needed for the multiplication by a factor of 50.

The test setup is pretty simple:

  • A is a 101x101 Matrix
  • v1 and v2 are Vectors of length 101
  • All entries are Complex128
    The test data is stored in data and can be used with the following code
using BenchmarkTools
using JLD

d = load("data.jld")
A = d["A"]
v1 = d["v1"]
v2 = d["v2"]

result1 = @benchmark $A * $v1
result2 = @benchmark $A * $v2


Running this code leads for me consistently to the result

Trial(23.188 μs)
Trial(1.028 ms)

I have the following questions:

  • Am I doing the benchmark correctly?
  • Is this result reproducible by anyone else?
  • Is there a theoretical reason why changing the values should influence the speed of the multiplication?
1 Like

Subnormals are the devil:

julia> @btime $A * $v1;
  3.449 μs (2 allocations: 1.78 KiB)

julia> @btime $A * $v2;
  82.263 μs (2 allocations: 1.78 KiB)

julia> set_zero_subnormals(true)

julia> @btime $A * $v1;
  3.385 μs (2 allocations: 1.78 KiB)

julia> @btime $A * $v2;
  3.384 μs (2 allocations: 1.78 KiB)

Just to elaborate on that, subnormals (aka denormals) are floating-point values with large negative exponents – so large that they no longer use all the bits of the value and have less than full precision. This allows something known as “gradual underflow” where you lose precision gradually, instead of immediately getting a zero value. Doing arithmetic with subnormal values does not go through the normal CPU pathways (on Intel hardware) and thus takes considerably longer – i.e. floating-point ops do not take a fixed number of clock cycles, which is what you’re seeing here.


A short note that this is architecture/implementation dependent. All Intel CPUs I’ve seen has the problem. Not sure about AMD x86 cores. None of ARMv7 and AArch64 cores have this problem.

It’s funny that intel originally proposed the gradual underflow (AFAIK) and now they have the slowest implementation dealing with them…


I know that I read the performance tip about subnormal numbers at some point but I didn’t know what they are and therefor didn’t really associate this problem with it. Great to learn something new and thank you all very much for your help!

ARMv7 and before never supported subnormal/denormal numbers, so they didn’t have the “[performance] problem” as they didn’t try to “deal”; they had “gradual underflow” problem. Since ARMv8 they are fully IEEE compliant but if I recall not my default:

"To permit this optimization, ARM floating-point implementations have a special processing mode called Flush-to-zero mode. AArch32 Advanced SIMD floating-point instructions always use Flush-to-zero mode.

Behavior in Flush-to-zero mode differs from normal IEEE 754 arithmetic in the following ways:"


"Flush to Zero mode. Indicates whether the VFP hardware implementation supports only the Flush-to-Zero mode of operation. Permitted values are:


0001 Hardware supports full denormalized number arithmetic."



If anybody cares to know I also found this:

Also strikes me as funny because in general my 2.7 GHz Intel is often about 50% faster than my 3.6 GHz AMD in benchmarks.
Almost surprising to hear they’re actually worse per clock cycle in some operations.

This is wrong. They do. NEON/ASIMD don’t, which is why it’s not the default FPU.

They do have the “correct” trade off in most cases since very few calculation actually need subnormal numbers.

1 Like

It seems you’re right about e.g. Cortex-A7, I found they have hardware support for normal and subnormal. Maybe I’m not reading the manuals right about if you can’t rely on in ARMv7, but certainly if you go to an old enough ARM arch, then you have no FPU at all :slight_smile:

http://liris.cnrs.fr/~mmrissa/lib/exe/fetch.php?media=armv7-a-r-manual.pdf ARMv7 manual
"Indicates whether the Floating-point Extension hardware implementation supports only the Flush-to-Zero mode of operation. Permitted values are:

0b0000 Hardware supports only the Flush-to-Zero mode of operation. If a VFP subarchitecture is implemented its support code might include support for full denormalized number arithmetic.
0b0001 Hardware supports full denormalized number arithmetic."

The FPU sitation over the years has been all over the place… In general you can’t rely in denormals, e.g. for Sun:

“IEEE 754 says nothing about a flush-to-zero mode to handle denormalized numbers faster, some architectures offer this mode (e.g. http://docs.sun.com/source/806-3568/ncg_lib.html ).
There are platforms that support flush-to-zero only, and there are many platforms where flush-to-zero is the default.
ARM Cortex cores have a flush to zero option, hard to see how you can ignore it. Then again, don’t take business advice from a forum.”

“For instance, the VFP11 coprocessor does not process subnormal input values directly. To provide correct handling of subnormal inputs according to the IEEE 754 standard, a trap is made to support code to process the operation”