FP16.16 (and FP32.16) float type (in Julia) and stochastic rounding

Palli · July 1, 2022, 5:40pm

Historically we had only Float64 and Float32 in hardware, with deterministic rounding; and Float16 only as a storage format in Julia. I don’t know if stochastic rounding is used in any mainstream (CPU) hardware (or Nvidia’s GPUs), by now but it seems Julia should support, if available, and could emulate otherwise (not just with an external package few know of):

The only known hardware implementation available are Graphcore’s IPUs with stochastic rounding. The software emulation of stochastic rounding in StochasticRounding.jl makes the number format slower
[…]
about x5 performance decrease for Float32 and BFloat16, but only x2 for Float16. For more complicated benchmarks the performance decrease is usually within x10. About 50% of the time is spend on the random number generation with Xoroshiro128+.

The rounding is done on each operation, it seems to me, but there no reason too for FP32.16 (also the package doesn’t use the new RNG in Julia as of 1.7; and 1-bit RNG can be optimized (I’ve done so), I think all we need).

@elrod, with your package it seems we could support Graphcore’s formats on CPUs (and should? Not just rely on different GPU/AI hardware’s floats, that I think just substitute your float types with a global setting). @Oscar_Smith might know if what more we need to change than e.g. sum in Base and/or more in that package.

https://docs.graphcore.ai/projects/ai-float-white-paper/en/latest/ai-float.html

If the accumulated partial sums of products are FP16, these are first expanded to FP32, then added, then rounded back to FP16, then stored. This rounding mode can be either Round to Nearest (ties to even) or Stochastic Rounding, discussed below.

Fig. 7.1 shows the test accuracy of ResNet50 trained on ImageNet for batch size 4 with SGD and different mixed-precision formats:

FP32: FP16.16 AMP and FP32 master weights

FP16RN: FP16.16 AMP and FP16 master weights without stochastic rounding

FP16SR: FP16.16 AMP and FP16 master weights with stochastic rounding

The results show that in this case the IEEE FP16 master weights with stochastic rounding can attain the IEEE FP32 state-of-the art accuracy.

When I interviewed at Graphcore, and I asked about the float format of their chip, they wouldn’t tell me (it and any info on the ISA was proprietary back then, so might, likely have patents, but only covering hardware, not software implementation?). Now we know the format, and at least one instruction, the AMP:

To facilitate the execution of lower precision general matrix multiplications (GEMMs), the Colossus IPU’s Accumulating Matrix Product (AMP) and Slim Convolution (SLIC) instructions deliver high performance multiply-accumulate sequences and are supported for both single-precision and half-precision number formats and operate in the same basic manner. […]

allows for an efficient parallel GEMM implementation.

For each AMP call, the Colossus IPU’s AMP engines perform a parallel summation on blocks of 16 numbers, each block consisting of four sets of a 4-element FP16 dot product. The multiplicands in each dot product are aligned, compressed redundantly, then rounded with partial truncation to yield a result in IEEE FP32 format. Each of the four FP32 results, from each of the four dot products in a block, is summed sequentially according to IEEE FP32 addition. Finally, previously accumulated partial sums of products, in FP32 or FP16, can be fetched from memory, added to the block result, and stored back to memory. If the accumulated partial sums of products are FP16, these are first expanded to FP32, then added, then rounded back to FP16, then stored. This rounding mode can be either Round to Nearest (ties to even) or Stochastic Rounding, discussed below.

giordano · June 27, 2023, 10:32pm

At next JuliaCon I’ll give a talk about Julia on IPU, showing also an application of hardware stochastic rounding

To give a taster, on the CPU with normal rounding you get

julia> sum(Float16(0.9) for _ in 1:1000)
Float16(965.5)

julia> sum(Float16(0.9) for _ in 1:1000) ≈ Float16(900)
false

On the IPU with stochastic rounding the sum of one thousand Float16(0.9) is on average Float16(903), and

julia> Float16(903) ≈ Float16(900)
true

For the record, we don’t need to do anything on the Julia side, it’s all done on hardware with a switch.

giordano · June 28, 2023, 12:45pm

That wasn’t very accurate, I just tried to compute 10k times the sum of one thousand Float16(0.9) and I got this distribution:

julia> mean(Float64.(sums))
899.92595

julia> std(Float64.(sums))
4.691193060057608

julia> median(sums)
Float16(900.0)

julia> extrema(sums)
(Float16(879.5), Float16(918.0))

julia> all(isapprox(900), sums)
true

Topic		Replies	Views
Does float16 run natively on a compatible CPU? General Usage	14	571	July 11, 2024
Apples to apples comparison of A\b with Float64 and Float16 on A64FX Performance question , linearalgebra	12	783	May 2, 2022
Massive performance penalty for Float16 compared to Float32 Performance performance	17	8050	June 20, 2022
Half precision Internals & Design question	2	569	February 1, 2022
Nvidia TF32 format GPU	2	3504	May 16, 2020

FP16.16 (and FP32.16) float type (in Julia) and stochastic rounding

Related topics