I want to present the results of one more tests I wanted to do. I have noticed that llama2-7b gives weird (read incorrect) results when used with Float16. I have set-up a small test demostrating the behavior to nail-down the source, but I have failed in this regard so far. Here is how it goes.
using ProfileSummarizer using Transformers using Flux using TextEncodeBase using NeuralAttentionlib using HuggingFaceApi using FiniteDifferences using Zygote using CUDA using StatsBase CUDA.device!(0) textenc = HuggingFace.load_tokenizer("meta-llama/Llama-2-7b-chat-hf"; auth_token = access_token); model = HuggingFace.load_model("llama", "meta-llama/Llama-2-7b-chat-hf", "forCausallm", auth_token = access_token); embeddings = model.model.embed.token.embeddings; decoder = model.model.decoder; function test_precisions(layer;togpu = gpu, use_hidden = true) layer = cpu(layer) tokens = TextEncodeBase.encode(textenc, "the most important text for this test").token xx = embeddings * tokens os = map((f16, f32)) do f l = togpu(f(layer)) θ = togpu(f(xx)) use_hidden ? l((;hidden_state = θ)).hidden_state : l(θ) end maximum(abs.(os .- os)) end test_precisions(decoder;togpu = gpu) # 80.53846f0 test_precisions(decoder;togpu = cpu) # be aware that this takes ages to execute # 80.53849f0 test_precisions(decoder.layers;togpu = gpu) # 0.023721457f0 test_precisions(decoder.layers;togpu = cpu) # 0.07958412f0 test_precisions(decoder.layers.attention;togpu = gpu) # 0.003145814f0 test_precisions(decoder.layers.attention;togpu = cpu) # 0.010890484f0 test_precisions(decoder.layers.feedforward;togpu = gpu) # 0.0047082305f0 test_precisions(decoder.layers.feedforward;togpu = cpu) # 0.055611372f0
In other words, for some reason, llama2 computational model accumulates errors between Float16 and Float32. I have found surprising the difference is so high on the end (80). According to the model card https://huggingface.co/meta-llama/Llama-2-7b-chat-hf/blob/main/config.json, the model was trained in Float16. Would be great if someone with a good python skills can execute similar test in pytorch, as I do not have easy access to the environment with working pytorch on A100.
I have found the problem by testing this notebook https://github.com/chengchingwen/Transformers.jl/blob/master/example/Llama2_example.ipynb with model converted to
f16. You can check that the model starts to halucinate. It might be that llama2-7b requires
BFloat16, but for this I think there is no support.