This is very intriguing (though I think the benefit of the methods doesn’t translate to 4-bit floating point, it may not matter as much, since likely useless for training neural networks; while useful for inference, though even then 2/3-bit integers taking over).
The proposed ℒ-Mul method will lead to a significantly reduced energy consumption for both model training and inference. […] multiplying two 32-bit floating point numbers (fp32) costs four times energy as adding two fp32 numbers, and 37 times higher cost than adding two 32-bit integers (int32). The rough energy costs for various operations are shown in Table 1. In PyTorch (Paszke et al., 2019), the default precision for accumulating tensor multiplication results is set to fp32. While I/O and control operations are not considered, approximating fp32 multiplications with int32 additions consumes only 1/37≈2.7% of the energy.
Actually their 4-bit mantissa is more accurate and faster than:
we find that ℒ-Mul is more accurate than fp8_e5m2 with evenly distributed operands. However, the weight distribution is often biased in pretrained LLMs. Based on the combined weight distribution of five popular LLMs, we find that ℒ-Mul can achieve higher precision beyond fp8_e4m3 with 5-bit mantissa operands in practice. We support both claims with estimated errors detailed in Appendix A.
2.3.2 Gate Complexity Estimation
[…]
In conclusion, the total amount of gate-level computation needed by fp8 Mul can be estimated as
Nfp16×≈584, Nfp8-e4m3×≈325, Nfp8-e5m2×≈296 (6)
ℒ-Mul consumes 1 XOR for sign prediction, 1 half adder, and k−2 full adders. The total gate count needed by 16-bit and 8-bit ℒ-Mul can be estimated as follows,
[…] Nfp16ℒ-mul≈256,Nfp8ℒ-mul≈157 (7)
ℒ-Mul with fp8_e4m3 and fp8_e5m2 operands have similar complexity since exponent offsets are typically implemented by 8-bit unsigned integer adders. As estimated, fp16 ℒ-Mul requires less gates than fp8 multiplications, and fp8 ℒ-Mul is significantly more efficient.
To summarize the error and complexity analysis, ℒ-Mul is both more efficient and more accurate than fp8 multiplication.
Note:
Related Work
[…]
Rounding and quantization. Standard neural network weights are stored as 32-bit or 16-bit FP tensors. However, the full-sized weights takes a considerable amount of GPU memory. To improve the storage efficiency, both weights storage and computation can be conducted in a lower precision, for example, using 16-bit, 8-bit, or 4-bit FP and Int (fp16, bf16 (Kalamkar et al., 2019), fp8-e4m3, fp8-e5m2 (Micikevicius et al., 2023), int8 (Dettmers et al., 2022), fp4, and int4 (Dettmers et al., 2024)) tensors to represent model weights.
They only compared to fp8, and with fp4 (or int4) being so small, wouldn’t a 32-byte lookup table be faster for any operation?