If you want to save space and my proposal for 8-bit fixed point isn’t good enough, there’s also a different number format (that’s simple, while I’ve not yet seen it implemented in Julia, and building on it the new format dynamic quantization, I just discovered):
Dynamic Tree Quantization
By moving the indicator bit, numbers can have a large exponent 10^-7 or precision as high as 1/63. […]
Dynamic tree quantization is strictly defined to quantize numbers in the range [-1.0, 1.0]
See chapter 1.3 and fig 2 at:
in your case, the sign bit could be sacrificed, for better precision up to 1/127.
So, 8-bit precision is absolutely being used, even for gradients, and the method in the paper is nice building on the above number format:
In this paper, we develop the first optimizers that use 8-bit statistics […] As a result, our 8-bit optimizers maintain 32-bit performance with a small fraction of the memory footprint on a range of tasks, including 1.5B parameter language modeling, GLUE finetuning, ImageNet classification, WMT’14 machine translation, MoCo v2 contrastive ImageNet pretraining+finetuning, and RoBERTa pretraining, without changes to the original optimizer hyperparameters. We open-source our 8-bit optimizers as a drop-in replacement that only requires a two-line code change.
For GPUs, this makes 8-bit optimizers faster than regular 32-bit optimizers, as we show in Section 3.
2.2 Dynamic Quantization
In this work, we extend dynamic tree quantization Section 1.3 for non-signed input tensors by re-purposing the sign bit. Since the second Adam state is strictly positive, the sign bit is not needed. Instead of just removing the sign bit, we opt to extend dynamic tree quantization with a fixed bit for the fraction. This extension is motivated by the observation that the second Adam state varies around 3-5 orders of magnitude during the training of a language model. In comparison, dynamic tree quantization already has a range of 7 orders of magnitude. We refer to this quantization as dynamic quantization to distinguish it from dynamic tree quantization in our experiments.
Note also, there is a package for logarithmic number format, e.g. ULogFloat16 (could be extended to ULogFloat8) that would be better than Float32 (and might be better than Float16, but, for you, then I’m not sure, at least some 8-bit format would be better). That package stores a sign (could be omitted), and can support 0.0 (even though log(0) isn’t defined), by special-casing.