Historically we had only Float64 and Float32 in hardware, with deterministic rounding; and Float16 only as a storage format in Julia. I don’t know if stochastic rounding is used in any mainstream (CPU) hardware (or Nvidia’s GPUs), by now but it seems Julia should support, if available, and could emulate otherwise (not just with an external package few know of):
The only known hardware implementation available are Graphcore’s IPUs with stochastic rounding. The software emulation of stochastic rounding in StochasticRounding.jl makes the number format slower
about x5 performance decrease for Float32 and BFloat16, but only x2 for Float16. For more complicated benchmarks the performance decrease is usually within x10. About 50% of the time is spend on the random number generation with Xoroshiro128+.
The rounding is done on each operation, it seems to me, but there no reason too for FP32.16 (also the package doesn’t use the new RNG in Julia as of 1.7; and 1-bit RNG can be optimized (I’ve done so), I think all we need).
@elrod, with your package it seems we could support Graphcore’s formats on CPUs (and should? Not just rely on different GPU/AI hardware’s floats, that I think just substitute your float types with a global setting). @Oscar_Smith might know if what more we need to change than e.g. sum in Base and/or more in that package.
If the accumulated partial sums of products are FP16, these are first expanded to FP32, then added, then rounded back to FP16, then stored. This rounding mode can be either Round to Nearest (ties to even) or Stochastic Rounding, discussed below.
Fig. 7.1 shows the test accuracy of ResNet50 trained on ImageNet for batch size 4 with SGD and different mixed-precision formats:
- FP32: FP16.16 AMP and FP32 master weights
- FP16RN: FP16.16 AMP and FP16 master weights without stochastic rounding
- FP16SR: FP16.16 AMP and FP16 master weights with stochastic rounding
The results show that in this case the IEEE FP16 master weights with stochastic rounding can attain the IEEE FP32 state-of-the art accuracy.
When I interviewed at Graphcore, and I asked about the float format of their chip, they wouldn’t tell me (it and any info on the ISA was proprietary back then, so might, likely have patents, but only covering hardware, not software implementation?). Now we know the format, and at least one instruction, the AMP:
To facilitate the execution of lower precision general matrix multiplications (GEMMs), the Colossus IPU’s Accumulating Matrix Product (AMP) and Slim Convolution (SLIC) instructions deliver high performance multiply-accumulate sequences and are supported for both single-precision and half-precision number formats and operate in the same basic manner. […]
allows for an efficient parallel GEMM implementation.
For each AMP call, the Colossus IPU’s AMP engines perform a parallel summation on blocks of 16 numbers, each block consisting of four sets of a 4-element FP16 dot product. The multiplicands in each dot product are aligned, compressed redundantly, then rounded with partial truncation to yield a result in IEEE FP32 format. Each of the four FP32 results, from each of the four dot products in a block, is summed sequentially according to IEEE FP32 addition. Finally, previously accumulated partial sums of products, in FP32 or FP16, can be fetched from memory, added to the block result, and stored back to memory. If the accumulated partial sums of products are FP16, these are first expanded to FP32, then added, then rounded back to FP16, then stored. This rounding mode can be either Round to Nearest (ties to even) or Stochastic Rounding, discussed below.