Hello,
I want to present the results of one more tests I wanted to do. I have noticed that llama2-7b gives weird (read incorrect) results when used with Float16. I have set-up a small test demostrating the behavior to nail-down the source, but I have failed in this regard so far. Here is how it goes.
using ProfileSummarizer
using Transformers
using Flux
using TextEncodeBase
using NeuralAttentionlib
using HuggingFaceApi
using FiniteDifferences
using Zygote
using CUDA
using StatsBase
CUDA.device!(0)
textenc = HuggingFace.load_tokenizer("meta-llama/Llama-2-7b-chat-hf"; auth_token = access_token);
model = HuggingFace.load_model("llama", "meta-llama/Llama-2-7b-chat-hf", "forCausallm", auth_token = access_token);
embeddings = model.model.embed.token.embeddings;
decoder = model.model.decoder;
function test_precisions(layer;togpu = gpu, use_hidden = true)
layer = cpu(layer)
tokens = TextEncodeBase.encode(textenc, "the most important text for this test").token
xx = embeddings * tokens
os = map((f16, f32)) do f
l = togpu(f(layer))
Īø = togpu(f(xx))
use_hidden ? l((;hidden_state = Īø)).hidden_state : l(Īø)
end
maximum(abs.(os[1] .- os[2]))
end
test_precisions(decoder;togpu = gpu)
# 80.53846f0
test_precisions(decoder;togpu = cpu) # be aware that this takes ages to execute
# 80.53849f0
test_precisions(decoder.layers[1][1];togpu = gpu)
# 0.023721457f0
test_precisions(decoder.layers[1][1];togpu = cpu)
# 0.07958412f0
test_precisions(decoder.layers[1][1].attention;togpu = gpu)
# 0.003145814f0
test_precisions(decoder.layers[1][1].attention;togpu = cpu)
# 0.010890484f0
test_precisions(decoder.layers[1][1].feedforward;togpu = gpu)
# 0.0047082305f0
test_precisions(decoder.layers[1][1].feedforward;togpu = cpu)
# 0.055611372f0
In other words, for some reason, llama2 computational model accumulates errors between Float16 and Float32. I have found surprising the difference is so high on the end (80). According to the model card https://huggingface.co/meta-llama/Llama-2-7b-chat-hf/blob/main/config.json, the model was trained in Float16. Would be great if someone with a good python skills can execute similar test in pytorch, as I do not have easy access to the environment with working pytorch on A100.
I have found the problem by testing this notebook https://github.com/chengchingwen/Transformers.jl/blob/master/example/Llama2_example.ipynb with model converted to Float16
by f16
. You can check that the model starts to halucinate. It might be that llama2-7b requires BFloat16
, but for this I think there is no support.