RFC: Compressed (base-6) floating point, taking 1/8 the space (new plan 1/64 compression for neural networks)

Palli · September 25, 2022, 12:27am

Thanks, great to know!

Not only does the float8 (E4M3) format match float16 across the board (with one exception), but where it matters (for the largest models) it slightly beats it!

i.e. if you look at: https://arxiv.org/pdf/2209.05433.pdf

Table 5: […] For F1 metrics higher is better, for perplexity lower is better. Best 8-bit result is bolded

better for BERT Large, and GPT3 6.7B (largest in that table).

This gives me some confidence, I might be one a good track with my 8x denser format. Note this part:

8-bit inference deployment is greatly simplified by FP8 training, as inference and training use the same datatypes. This is in contrast to int8 inference with networks trained in 32- or 16-bit floating point, which require post-training quantization (PTQ) calibration and sometimes quantization-aware training (QAT) in order to maintain model accuracy. Furthermore, even with quantization aware training some int8-quantized models may not completely recover the accuracy achieved with floating point [1].

and:

Our study covers the main modern neural network architectures - CNNs, RNNs, and Transformer-based models, leaving all the hyperparameters unchanged from the 16-bit baseline training sessions. Our
training experiments include large, up to 175B parameter, language models

My worry with binary and ternary networks were that they might not work for Transformers, or only some very specific neural-network application.

I would want my format to be a drop-in replacement for neural networks, just a new datatype, without needing new hyperparameters.

Since int8 is worse than float8, what hope would there be for binary or ternary, since those are in effect at best int1 or int2? My format doesn’t strictly depend on such good.

Since they are designing hardware, it’s understandable that they go with a format for a scalar number (even though SIMD vectors are a thing, usually just repeated scalars; shared exponent format is though also a thing in some exotic hardware that I believe didn’t catch on, Intel bought that company).

But my format would only be 1-bit when a network were fully dense (or well allowing 58 of 64 values in a chunk), and 2-bit when you do away with 32 values in a row out of a 64 chunk, then “7.25 bits for 8 values” in a row of 64 values.

So my hypothesis, or hope, is that networks would be at least that sparse (on average).

Topic		Replies	Views
Posits - a new approach could sink floating point computation Numerics	49	10149	August 11, 2019
Posit Standard (2022): Floats/IEEE vs Posits, and ternary math Offtopic	9	2450	September 8, 2022
Quaternion-, and up to, sedenion-valued neural networks: Parallelizing Hamilton product on GPUs/CUDA Machine Learning gpu	7	2728	October 17, 2020
Will decreasing the precision of intermediate variables improve performance of code? Performance performance	12	1221	May 9, 2020
Knet vs MXNet for programmer new to ML Machine Learning knet	25	6800	October 6, 2018

RFC: Compressed (base-6) floating point, taking 1/8 the space (new plan 1/64 compression for neural networks)

Related topics