I guess we have all see they hype avbout the A100. Is this going to be useful for Julia GPU?
in the early time - the
BFloat16 format will be more supported
based on this news:
BFloat16 Support About To Land Within LLVM (13 May 2020)
“Arm has been pushing along the BFloat16 support for LLVM with ARMv8.6-A supporting the new format. But this BFloat16 LLVM support is also relevant ultimately for Intel AVX-512 BF16, Intel Nervana, Google Cloud TPUs, and other hardware coming out with BF16 support to bolster their machine learning capabilities.”
Disclosure - I work for Nvidia (specifically TensorRT) and have written unit tests that can distinguish whether TF32 kicked in or not.
The advantage of TF32 is that the format is the same as FP32. When computing inner products with TF32, the input operands have their mantissas rounded from 23 bits to 10 bits. The rounded operands are multiplied exactly, and accumulated in normal FP32.
The big advantage of TF32 is that compiler support is required only at the deepest levels, i.e. inside the Cuda compiler. The rest of code just sees FP32 with less precision, but the same dynamic range. Big linear operations are usually done via libraries anyway, e.g. the BLAS sgemm. So exploiting TF32 will largely be a matter of tweaking callers of these libraries to indicate whether TF32 is okay. E.g., perhaps use it for the initial iterations of a linear solver, and then use slower FP32 to polish the results.
Formats such as FP16 and BFloat16 are more work, since they involve different bit layouts. We still encourage programmers to put in effort into using those formats, since they reduce memory bandwidth and consequently permit even faster execution. TF32 exists as something that can be quickly plugged in to exploit Tensor Core speed without much work.