[Graph]CodeBERT; and e.g. (2- to) 8-bit int networks better than float, for Transformers/BERT

A.

Pre-trained models for programming language have achieved dramatic empirical improvements on a variety of code-related tasks such as […] However, existing pre-trained models regard a code snippet as a sequence of tokens, while ignoring the inherent structure of code, which provides crucial code semantics and would enhance the code understanding process. We present GraphCodeBERT, a pre-trained model for programming language that considers the inherent structure of code. Instead of taking syntactic-level structure of code like abstract syntax tree (AST) […] achieves state-of-the-art performance on the four downstream tasks. We further show that the model prefers structure-level attentions over token-level attentions in the task of code search
[…]
Table 2: Results on code clone detection. Graph-CodeBERT outperforms other pre-trained methods significantly (p < 0.01)

There seems to be less than 170 lines to support each language (also in other files?), 166 for Python, less for Ruby (anyone wants to add Julia support?): https://github.com/microsoft/CodeBERT/blob/master/GraphCodeBERT/translation/parser/DFG.py

CodeBERT in total seems surprisingly short, GraphCodeBERT well longer.

B.

I-BERT paper:

Specifically, we process Embedding and matrix multiplication (MatMul) with INT8 multiplication and INT32 accumulation. The following non-linear operations (GELU, Softmax, and LayerNorm) are then calculated on the INT32 accumulated result and then re-
quantized back to INT8. We represent all parameters and activations in the entire computational graph with integers, and we never cast them into floating point.
[…]
We show that INT8 inference achieves up to 4×speedup as compared to FP32 inference.

I-BERT: Integer-only BERT Quantization

Based on lightweight integer-only approximation methods
for nonlinear operations, e.g., GELU, Softmax, and Layer
Normalization, I-BERT performs an end-to-end
integer-only BERT inference without any float-
ing point calculation. We evaluate our approach
on GLUE downstream tasks using RoBERTa-
Base/Large. We show that for both cases, I-BERT
achieves similar (and slightly higher) accuracy as
compared to the full-precision baseline. Further-
more, our preliminary implementation of I-BERT
shows a speedup of 2.4 −4.0× for INT8 infer-
ence on a T4 GPU system as compared to FP32
inference.

References these papers:

and:

BinaryBERT: Pushing the Limit of BERT Quantization
https://arxiv.org/pdf/2012.15701.pdf

Empirical results show that our BinaryBERT has only a slight performance drop compared with the full-precision model while being 24x smaller, achieving the state-of-the-art compression results on the GLUE and SQuAD benchmarks.
[…]
However, none of them
achieves the binarization (1-bit). As the limit of
quantization, weight binarization could bring at
most 32×reduction in model size and replace most
floating-point multiplications with additions. More-
over, quantizing activations to 8-bit or 4-bit further
replaces the floating-point addition with int8 and
int4 addition, decreasing the energy burden and the
area usage on chips

The papers above, at least at the top, mentions interference, not training, so I was curious if 8-bit (or lower) ints can’t work for it. However it references this paper:

and:

The main idea of our method is that the KD technique is leveraged to transfer the knowledge from a “teacher” model to a “student” model when exploiting LSQ to quantize that “student” model during the quantization training process. Extensive experiment results on GLUE benchmark and SQuAD demonstrate that our proposed KDLSQ-BERT not only performs effectively when doing different bit (e.g. 2-bit ∼ 8-bit) quantization, but also outperforms the existing BERT quantization methods, and even achieves comparable performance as the full-precision base-line model while obtaining 14.9x compression ratio.

I though all of this might not be common knowledge, then I found the Nvidia link I put at the top of the section, and this one also intriguing:

The NVIDIA DGX SuperPOD with 92 DGX-2H nodes set a new record by training BERT-Large in just 47 minutes.

It’s still a lot of hardware, unclear if all the tricks in the book used, at the time in 2019, I saw recently, in 2021, for 1-bit Adam, from days to 2 hours if I recall for a language model so that may have lot less hardware (while still multi-GPU).

[I meant to make my own 8-bit floating point format (different from already available in Julia package, optimiized for software implementaion), since 8-bit enough for a lot of things (shown by Microsoft in 8/9-bit float paper for FPGAs), mostly for ML/ANNs, I guess I’ll abandon that plan…]

C.
Is Space-Time Attention All You Need for Video Understanding?
https://arxiv.org/pdf/2102.05095.pdf