LLama2-7b difference in inference when between Float16 and Float32

Hello,

I want to present the results of one more tests I wanted to do. I have noticed that llama2-7b gives weird (read incorrect) results when used with Float16. I have set-up a small test demostrating the behavior to nail-down the source, but I have failed in this regard so far. Here is how it goes.

using ProfileSummarizer
using Transformers
using Flux
using TextEncodeBase
using NeuralAttentionlib
using HuggingFaceApi
using FiniteDifferences
using Zygote
using CUDA
using StatsBase


CUDA.device!(0)
textenc = HuggingFace.load_tokenizer("meta-llama/Llama-2-7b-chat-hf"; auth_token = access_token);
model = HuggingFace.load_model("llama", "meta-llama/Llama-2-7b-chat-hf", "forCausallm", auth_token = access_token);
embeddings = model.model.embed.token.embeddings;
decoder = model.model.decoder;

function test_precisions(layer;togpu = gpu, use_hidden = true)
	layer = cpu(layer)
	tokens = TextEncodeBase.encode(textenc, "the most important text for this test").token
	xx = embeddings * tokens
	os = map((f16, f32)) do f 
		l = togpu(f(layer))
		Īø = togpu(f(xx))
		use_hidden ? l((;hidden_state = Īø)).hidden_state : l(Īø)
	end
	maximum(abs.(os[1] .- os[2]))
end

test_precisions(decoder;togpu = gpu)
# 80.53846f0
test_precisions(decoder;togpu = cpu) # be aware that this takes ages to execute
# 80.53849f0

test_precisions(decoder.layers[1][1];togpu = gpu)
# 0.023721457f0
test_precisions(decoder.layers[1][1];togpu = cpu)
# 0.07958412f0

test_precisions(decoder.layers[1][1].attention;togpu = gpu)
# 0.003145814f0
test_precisions(decoder.layers[1][1].attention;togpu = cpu)
# 0.010890484f0

test_precisions(decoder.layers[1][1].feedforward;togpu = gpu)
# 0.0047082305f0
test_precisions(decoder.layers[1][1].feedforward;togpu = cpu)
# 0.055611372f0

In other words, for some reason, llama2 computational model accumulates errors between Float16 and Float32. I have found surprising the difference is so high on the end (80). According to the model card https://huggingface.co/meta-llama/Llama-2-7b-chat-hf/blob/main/config.json, the model was trained in Float16. Would be great if someone with a good python skills can execute similar test in pytorch, as I do not have easy access to the environment with working pytorch on A100.

I have found the problem by testing this notebook https://github.com/chengchingwen/Transformers.jl/blob/master/example/Llama2_example.ipynb with model converted to Float16 by f16. You can check that the model starts to halucinate. It might be that llama2-7b requires BFloat16, but for this I think there is no support.

1 Like

My first thought was are you sure itā€™s not bfloat16? It seems not but Float16 (either format), on Julia rounds for each operation losing accuracy, thus accumulates error.

Are you running the model on the GPU? It might be that those do all operations with a larger accumulator. Iā€™m not sure a CPU has that capability, unless you cast to Float32 or Float64, and you would likely need to do it explicitly.

The Llama2 models were trained using bfloat16, but the original inference uses float16. The checkpoints uploaded on the hub use torch_dtype = ā€˜float16ā€™which will be used by theAutoModelAPI to cast the checkpoints fromtorch.float32totorch.float16`.

The dtype of the online weights is mostly irrelevant, unless you are using torch_dtype=ā€œautoā€ when initializing a model using model = AutoModelForCausalLM.from_pretrained(ā€œpathā€, torch_dtype = ā€œautoā€). The reason is that the model will first be downloaded ( using the dtype of the checkpoints online) then it will be casted to the default dtype of torch (becomes torch.float32) and finally, if there is a torch_dtype provided in the config, it will be used.

Training the model in float16 is not recommended and known to produce nan, as such the model should be trained in bfloat16.

I couldnā€™t confirm, since that link doesnā€™t work, also it was trained in a GPU most likley, thus not really in 16 bits only. Maybe itā€™s just natural you canā€™t use Float16, at least on CPUs. Besides itā€™s very much slower, only though of as a storage format.

Note in case helpful to you:

C++ has (but not C) since C++23 bfloat16 i.e. std::bfloat16_t (also std::float16_t, C has it):
https://en.cppreference.com/w/cpp/types/floating-point

@Oscar_Smith Maybe Julia should add bloat16, to catch up with C++ futureā€¦ though a package is as good (a different argument can be made for standardized languages and their stdlibs), maybe no need to have in the Julia non-standard, rather excise Float16ā€¦?

1 Like