Thanks, great to know!
Not only does the float8 (E4M3) format match float16 across the board (with one exception), but where it matters (for the largest models) it slightly beats it!
i.e. if you look at: https://arxiv.org/pdf/2209.05433.pdf
Table 5: […] For F1 metrics higher is better, for perplexity lower is better. Best 8-bit result is bolded
better for BERT Large, and GPT3 6.7B (largest in that table).
This gives me some confidence, I might be one a good track with my 8x denser format. Note this part:
8-bit inference deployment is greatly simplified by FP8 training, as inference and training use the same datatypes. This is in contrast to int8 inference with networks trained in 32- or 16-bit floating point, which require post-training quantization (PTQ) calibration and sometimes quantization-aware training (QAT) in order to maintain model accuracy. Furthermore, even with quantization aware training some int8-quantized models may not completely recover the accuracy achieved with floating point [1].
and:
Our study covers the main modern neural network architectures - CNNs, RNNs, and Transformer-based models, leaving all the hyperparameters unchanged from the 16-bit baseline training sessions. Our
training experiments include large, up to 175B parameter, language models
My worry with binary and ternary networks were that they might not work for Transformers, or only some very specific neural-network application.
I would want my format to be a drop-in replacement for neural networks, just a new datatype, without needing new hyperparameters.
Since int8 is worse than float8, what hope would there be for binary or ternary, since those are in effect at best int1 or int2? My format doesn’t strictly depend on such good.
Since they are designing hardware, it’s understandable that they go with a format for a scalar number (even though SIMD vectors are a thing, usually just repeated scalars; shared exponent format is though also a thing in some exotic hardware that I believe didn’t catch on, Intel bought that company).
But my format would only be 1-bit when a network were fully dense (or well allowing 58 of 64 values in a chunk), and 2-bit when you do away with 32 values in a row out of a 64 chunk, then “7.25 bits for 8 values” in a row of 64 values.
So my hypothesis, or hope, is that networks would be at least that sparse (on average).