16 bit float on Transformers.jl

RainerHeintzmann · December 20, 2023, 5:01pm

I am new to LLMs but thought to give it a shot using Transformers.jl. This worked rather well, after creating an account at HuggingFace and getting permission for Llama-7B from Meta. Yet, I was surprised how much GPU ram was needed even for this “small” model. Running the notebook example, works, but it maxes out my GPU memory on a 47Gb RTX A6000.

Answering a question like the example “Can you explain to me briefly what is the Julia programming language?” takes about 1 minute on that GPU.

I wonder, whether one can speed up things by using 16 bit floats in the weigths? Is this supported by Transformers.jl? Can one use the same model and convert it or does one needs another one from HuggingFace?
Or is the system always claiming all GPU memory anyway, and the speed is what is to be expected?

Thanks for any help!

mohamed.d180 · December 20, 2023, 8:06pm

I don’t know if this will help :
State of quantization

Tomas_Pevny · December 20, 2023, 8:43pm

16bit with Transformers will work out of the box, but it will hallucinate. I believe the problem is that when we convert weights and embeddings to Float16, the entire computation is carried in Float16. What HF’s transformers do, I believe, is that they store weights in Float16, but perform the computation in Float32, which gives them the best of both words. I think we would need to i) either convert weights just before use to Float32, or define a matmul for multiplication of Float16 and Float32 matrices.

pitsianis · December 21, 2023, 2:56am

You may want to consider GitHub - cafaxo/Llama2.jl: llama2.c but in Julia

Using llama-2-7b-chat.ggmlv3.q4_K_S.bin,
for the same prompt, I got 10 tokens per second on a MacBook Pro.

RainerHeintzmann · December 21, 2023, 3:00pm

Thanks for this hint.
I was using the Tiny-Lllama example (the download link being in the code), which indeed gave me roughly 200 tokens per second. Nice.

But then I tried downloading llama-2-7b-chat.ggmlv3.q4_K_S.bin but I am not sure where to find it. It was listed on HuggingFace, but I could not find a way to download anything, trying for about an hour. In the description it says somehting about a download button, but I could not find one (being logged in). It also sais something about this being an old format.
Is there any link for this?

Can these (Clearly smaller) models also be run using Transformers.jl?

pitsianis · December 21, 2023, 3:31pm

RainerHeintzmann · December 21, 2023, 4:40pm

Thanks a lot! This worked. Yet, when running it, I am now surprised that there seems to be little or no use to the GPU? This model not use any GPU ram, and the speed was also not that great (1,7 tokens per second).
Is that Llama2.jl toolbox a CPU-only Llama implementation?

Tomas_Pevny · December 28, 2023, 7:40pm

No,

you can call llama from Transformers.jl, this is what I did. See the notebook

Topic		Replies	Views
LLama2-7b difference in inference when between Float16 and Float32 Machine Learning	1	3039	September 13, 2023
Mix-mode training of large languages models in Julia Machine Learning	7	609	July 26, 2023
Status of BFloat16 Performance	4	789	August 24, 2023
Downloading transformer weights General Usage tensorflow , flux , machine-learning , transformers	0	44	November 21, 2024
How to load models from HuggingFace with Transformers.jl Machine Learning	1	337	June 20, 2023

16 bit float on Transformers.jl

Related topics