[Graph]CodeBERT; and e.g. (2- to) 8-bit int networks better than float, for Transformers/BERT

Palli · August 17, 2021, 4:49pm

A.

Pre-trained models for programming language have achieved dramatic empirical improvements on a variety of code-related tasks such as […] However, existing pre-trained models regard a code snippet as a sequence of tokens, while ignoring the inherent structure of code, which provides crucial code semantics and would enhance the code understanding process. We present GraphCodeBERT, a pre-trained model for programming language that considers the inherent structure of code. Instead of taking syntactic-level structure of code like abstract syntax tree (AST) […] achieves state-of-the-art performance on the four downstream tasks. We further show that the model prefers structure-level attentions over token-level attentions in the task of code search
[…]
Table 2: Results on code clone detection. Graph-CodeBERT outperforms other pre-trained methods significantly (p < 0.01)

There seems to be less than 170 lines to support each language (also in other files?), 166 for Python, less for Ruby (anyone wants to add Julia support?): https://github.com/microsoft/CodeBERT/blob/master/GraphCodeBERT/translation/parser/DFG.py

CodeBERT in total seems surprisingly short, GraphCodeBERT well longer.

B.

I-BERT paper:

Specifically, we process Embedding and matrix multiplication (MatMul) with INT8 multiplication and INT32 accumulation. The following non-linear operations (GELU, Softmax, and LayerNorm) are then calculated on the INT32 accumulated result and then re-
quantized back to INT8. We represent all parameters and activations in the entire computational graph with integers, and we never cast them into floating point.
[…]
We show that INT8 inference achieves up to 4×speedup as compared to FP32 inference.

I-BERT: Integer-only BERT Quantization

Based on lightweight integer-only approximation methods
for nonlinear operations, e.g., GELU, Softmax, and Layer
Normalization, I-BERT performs an end-to-end
integer-only BERT inference without any float-
ing point calculation. We evaluate our approach
on GLUE downstream tasks using RoBERTa-
Base/Large. We show that for both cases, I-BERT
achieves similar (and slightly higher) accuracy as
compared to the full-precision baseline. Further-
more, our preliminary implementation of I-BERT
shows a speedup of 2.4 −4.0× for INT8 infer-
ence on a T4 GPU system as compared to FP32
inference.

References these papers:

and:

BinaryBERT: Pushing the Limit of BERT Quantization
https://arxiv.org/pdf/2012.15701.pdf

Empirical results show that our BinaryBERT has only a slight performance drop compared with the full-precision model while being 24x smaller, achieving the state-of-the-art compression results on the GLUE and SQuAD benchmarks.
[…]
However, none of them
achieves the binarization (1-bit). As the limit of
quantization, weight binarization could bring at
most 32×reduction in model size and replace most
floating-point multiplications with additions. More-
over, quantizing activations to 8-bit or 4-bit further
replaces the floating-point addition with int8 and
int4 addition, decreasing the energy burden and the
area usage on chips

The papers above, at least at the top, mentions interference, not training, so I was curious if 8-bit (or lower) ints can’t work for it. However it references this paper:

and:

The main idea of our method is that the KD technique is leveraged to transfer the knowledge from a “teacher” model to a “student” model when exploiting LSQ to quantize that “student” model during the quantization training process. Extensive experiment results on GLUE benchmark and SQuAD demonstrate that our proposed KDLSQ-BERT not only performs effectively when doing different bit (e.g. 2-bit ∼ 8-bit) quantization, but also outperforms the existing BERT quantization methods, and even achieves comparable performance as the full-precision base-line model while obtaining 14.9x compression ratio.

github.com/huggingface/transformers

I-BERT model support

huggingface:master ← kssteven418:ibert

opened 03:26PM - 12 Feb 21 UTC

kssteven418

+3279 -0

# What does this PR do? This PR implements [I-BERT](https://arxiv.org/abs/210…1.01321), an integer-only quantization scheme for Transformer architectures. I-BERT is based on the model architecture and the pre-trained parameters of RoBERTa (this can be extended to other architectures as a future task), except that it calls custom integer-only operations instead of the normal ones. (The custom kernels are implemented in `ibert/quant_modules.py`.) Therefore, under the current implementation, I-BERT inherits its tokenizer and configuration from the RoBERTa’s, and pulls the model parameter from the `roberta-base/large` repo. The model can be finetuned on a specific task in 2-pass, 1) Finetune the model on a given task with the normal mode (`config.quant_mode = False`) before quantizing it. The model will then take the normal non-quantized pass. 2) Once the model achieves the best accuracy, do another finetuning with the quantization mode (`config.quant_mode = True`). The model will then take the integer-only quantized pass to recover the accuracy degradation through quantization-aware training. You can skip the first pass and do task-specific finetuning and quantization-aware training at the same time, but it normally results in lower accuracy. Here are some missing features and TODOs: - [x] Static quantization: activation ranges (min/max) must be fixed in evaluation time. - [x] `ibert-roberta-large` support - [ ] Test on different types of tasks - [ ] More intuitive APIs? ## Results on the GLUE tasks * RTE, MRPC, SST2, and QNLI with `ibert-roberta-base` * Without extensive hyperparameter tuning (the results, both the baseline and I-BERT, could be improved) Task | RTE | MRPC | SST2 | QNLI --- | --- | --- | --- |--- Baseline(FP32) | 74.37 | 90.75 | 92.15 | 92.89 I-BERT(INT8) | 79.78 | 91.18 | 93.81 | 91.83  ## Before submitting - [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [ ] Did you read the [contributor guideline](https://github.com/huggingface/transformers/blob/master/CONTRIBUTING.md#start-contributing-pull-requests), Pull Request section? - [ ] Was this discussed/approved via a Github issue or the [forum](https://discuss.huggingface.co/)? Please add a link to it if that's the case. - [ ] Did you make sure to update the documentation with your changes? Here are the [documentation guidelines](https://github.com/huggingface/transformers/tree/master/docs), and [here are tips on formatting docstrings](https://github.com/huggingface/transformers/tree/master/docs#writing-source-documentation). - [ ] Did you write any new necessary tests?

I though all of this might not be common knowledge, then I found the Nvidia link I put at the top of the section, and this one also intriguing:

The NVIDIA DGX SuperPOD with 92 DGX-2H nodes set a new record by training BERT-Large in just 47 minutes.

It’s still a lot of hardware, unclear if all the tricks in the book used, at the time in 2019, I saw recently, in 2021, for 1-bit Adam, from days to 2 hours if I recall for a language model so that may have lot less hardware (while still multi-GPU).

[I meant to make my own 8-bit floating point format (different from already available in Julia package, optimiized for software implementaion), since 8-bit enough for a lot of things (shown by Microsoft in 8/9-bit float paper for FPGAs), mostly for ML/ANNs, I guess I’ll abandon that plan…]

C.
Is Space-Time Attention All You Need for Video Understanding?
https://arxiv.org/pdf/2102.05095.pdf

Topic		Replies	Views
ChatGPT Friendly Programming Languages Offtopic	6	2965	January 13, 2024
Community Interest Check: LLMs from Scratch in Pure Julia Offtopic package	43	2647	January 31, 2025
Training sentence transformers in Julia? Machine Learning question , transformers , sbert , bert , sentence-transformer	0	576	November 14, 2021
New tools for reducing compiler latency Package Announcements	7	4006	January 28, 2021
Challenge: Can you beat Python and C++ in Int4 Matrix-Vector Multiply Op? Performance bitpacking , llm , quantization , integer	10	1495	July 25, 2023

Related topics