16 bit float on Transformers.jl

I am new to LLMs but thought to give it a shot using Transformers.jl. This worked rather well, after creating an account at HuggingFace and getting permission for Llama-7B from Meta. Yet, I was surprised how much GPU ram was needed even for this “small” model. Running the notebook example, works, but it maxes out my GPU memory on a 47Gb RTX A6000.

Answering a question like the example “Can you explain to me briefly what is the Julia programming language?” takes about 1 minute on that GPU.

I wonder, whether one can speed up things by using 16 bit floats in the weigths? Is this supported by Transformers.jl? Can one use the same model and convert it or does one needs another one from HuggingFace?
Or is the system always claiming all GPU memory anyway, and the speed is what is to be expected?

Thanks for any help!

I don’t know if this will help :
State of quantization

16bit with Transformers will work out of the box, but it will hallucinate. I believe the problem is that when we convert weights and embeddings to Float16, the entire computation is carried in Float16. What HF’s transformers do, I believe, is that they store weights in Float16, but perform the computation in Float32, which gives them the best of both words. I think we would need to i) either convert weights just before use to Float32, or define a matmul for multiplication of Float16 and Float32 matrices.

You may want to consider GitHub - cafaxo/Llama2.jl: llama2.c but in Julia

Using llama-2-7b-chat.ggmlv3.q4_K_S.bin,
for the same prompt, I got 10 tokens per second on a MacBook Pro.

Thanks for this hint.
I was using the Tiny-Lllama example (the download link being in the code), which indeed gave me roughly 200 tokens per second. Nice.

But then I tried downloading llama-2-7b-chat.ggmlv3.q4_K_S.bin but I am not sure where to find it. It was listed on HuggingFace, but I could not find a way to download anything, trying for about an hour. In the description it says somehting about a download button, but I could not find one (being logged in). It also sais something about this being an old format.
Is there any link for this?

Can these (Clearly smaller) models also be run using Transformers.jl?

Thanks a lot! This worked. Yet, when running it, I am now surprised that there seems to be little or no use to the GPU? This model not use any GPU ram, and the speed was also not that great (1,7 tokens per second).
Is that Llama2.jl toolbox a CPU-only Llama implementation?

No,

you can call llama from Transformers.jl, this is what I did. See the notebook