[Graph]CodeBERT; and e.g. (2- to) 8-bit int networks better than float, for Transformers/BERT


Pre-trained models for programming language have achieved dramatic empirical improvements on a variety of code-related tasks such as […] However, existing pre-trained models regard a code snippet as a sequence of tokens, while ignoring the inherent structure of code, which provides crucial code semantics and would enhance the code understanding process. We present GraphCodeBERT, a pre-trained model for programming language that considers the inherent structure of code. Instead of taking syntactic-level structure of code like abstract syntax tree (AST) […] achieves state-of-the-art performance on the four downstream tasks. We further show that the model prefers structure-level attentions over token-level attentions in the task of code search
Table 2: Results on code clone detection. Graph-CodeBERT outperforms other pre-trained methods significantly (p < 0.01)

There seems to be less than 170 lines to support each language (also in other files?), 166 for Python, less for Ruby (anyone wants to add Julia support?): CodeBERT/DFG.py at master · microsoft/CodeBERT · GitHub

CodeBERT in total seems surprisingly short, GraphCodeBERT well longer.


I-BERT paper:

Specifically, we process Embedding and matrix multiplication (MatMul) with INT8 multiplication and INT32 accumulation. The following non-linear operations (GELU, Softmax, and LayerNorm) are then calculated on the INT32 accumulated result and then re-
quantized back to INT8. We represent all parameters and activations in the entire computational graph with integers, and we never cast them into floating point.
We show that INT8 inference achieves up to 4×speedup as compared to FP32 inference.

I-BERT: Integer-only BERT Quantization

Based on lightweight integer-only approximation methods
for nonlinear operations, e.g., GELU, Softmax, and Layer
Normalization, I-BERT performs an end-to-end
integer-only BERT inference without any float-
ing point calculation. We evaluate our approach
on GLUE downstream tasks using RoBERTa-
Base/Large. We show that for both cases, I-BERT
achieves similar (and slightly higher) accuracy as
compared to the full-precision baseline. Further-
more, our preliminary implementation of I-BERT
shows a speedup of 2.4 −4.0× for INT8 infer-
ence on a T4 GPU system as compared to FP32

References these papers:


BinaryBERT: Pushing the Limit of BERT Quantization

Empirical results show that our BinaryBERT has only a slight performance drop compared with the full-precision model while being 24x smaller, achieving the state-of-the-art compression results on the GLUE and SQuAD benchmarks.
However, none of them
achieves the binarization (1-bit). As the limit of
quantization, weight binarization could bring at
most 32×reduction in model size and replace most
floating-point multiplications with additions. More-
over, quantizing activations to 8-bit or 4-bit further
replaces the floating-point addition with int8 and
int4 addition, decreasing the energy burden and the
area usage on chips

The papers above, at least at the top, mentions interference, not training, so I was curious if 8-bit (or lower) ints can’t work for it. However it references this paper:


The main idea of our method is that the KD technique is leveraged to transfer the knowledge from a “teacher” model to a “student” model when exploiting LSQ to quantize that “student” model during the quantization training process. Extensive experiment results on GLUE benchmark and SQuAD demonstrate that our proposed KDLSQ-BERT not only performs effectively when doing different bit (e.g. 2-bit ∼ 8-bit) quantization, but also outperforms the existing BERT quantization methods, and even achieves comparable performance as the full-precision base-line model while obtaining 14.9x compression ratio.

I though all of this might not be common knowledge, then I found the Nvidia link I put at the top of the section, and this one also intriguing:

The NVIDIA DGX SuperPOD with 92 DGX-2H nodes set a new record by training BERT-Large in just 47 minutes.

It’s still a lot of hardware, unclear if all the tricks in the book used, at the time in 2019, I saw recently, in 2021, for 1-bit Adam, from days to 2 hours if I recall for a language model so that may have lot less hardware (while still multi-GPU).

[I meant to make my own 8-bit floating point format (different from already available in Julia package, optimiized for software implementaion), since 8-bit enough for a lot of things (shown by Microsoft in 8/9-bit float paper for FPGAs), mostly for ML/ANNs, I guess I’ll abandon that plan…]

Is Space-Time Attention All You Need for Video Understanding?