Gradient of llama2 computed by Zygote seems to be incorrect

Hi,

I have tested gradient of llama2-7b model with respect its input and it gives me wrong results (tested only on GPU). I have created a MWE as follows

using ProfileSummarizer
using Transformers
using Flux
using TextEncodeBase
using NeuralAttentionlib
using HuggingFaceApi
using FiniteDifferences
using Zygote
using CUDA
using StatsBase


CUDA.device!(0)
textenc = HuggingFace.load_tokenizer("meta-llama/Llama-2-7b-chat-hf"; auth_token = access_token);
model = f32(HuggingFace.load_model("llama", "meta-llama/Llama-2-7b-chat-hf", "forCausallm", auth_token = access_token));
model = gpu(model);
embeddings = model.model.embedding.token.embeddings;
decoder = model.model.decoder;


# We compute randomly gradients with respect to few selected elements and compare them
# to true values. This is a sanity check, because otherwise the test would take ages.

let begin
	tokens = TextEncodeBase.encode(textenc, "the most important text for this test").token
	θ = cpu(embeddings * tokens)
	ii = sample(eachindex(θ), 1000, replace = false)
	ii = [23, 5641, 18928, 4354, 16801, 294, 18929, 17793, 1277, 17098]
	sub_θ = θ[ii]
	function sub_f(sub_θ)
		θ[ii] = sub_θ
		f(θ)
	end

	function f(θ)
		hidden_state = gpu(θ)
		sum(decoder((;hidden_state)).hidden_state)
	end

	fin_gs = grad(central_fdm(5, 1), sub_f, sub_θ)[1]
	zyg_gs = Zygote.gradient(f, θ)[1][ii]
	hcat(fin_gs, zyg_gs)
	sort(abs.(fin_gs .- zyg_gs), rev = true)
end

which returns

10-element Vector{Float32}:
 4.4233932
 2.523777
 1.9660205
 1.4594455
 0.67348194
 0.589447
 0.5284171
 0.5228882
 0.36881256
 0.18504143

which seems to me off by a large margin.

I use Julia 1.9.2 and my Pkg.status() says

  [7d9f7c33] Accessors v0.1.32
  [6e4b80f9] BenchmarkTools v1.3.2
⌃ [052768ef] CUDA v4.4.0
  [d360d2e6] ChainRulesCore v1.16.0
  [a93c6f00] DataFrames v1.6.1
  [26cc04aa] FiniteDifferences v0.12.30
⌅ [587475ba] Flux v0.13.17
  [d9f16b24] Functors v0.4.5
  [3cc741c3] HuggingFaceApi v0.1.0
  [f1d291b0] MLUtils v0.4.3
  [5da4648a] NVTX v0.3.2
⌃ [12afc1b8] NeuralAttentionlib v0.2.11
  [0b1bfda6] OneHotArrays v0.2.4
  [6099a3de] PythonCall v0.9.14
  [2913bbd2] StatsBase v0.34.0
  [354b36f9] StringViews v1.3.3
⌃ [899adc3e] TensorBoardLogger v0.1.21
  [f92c20c0] TextEncodeBase v0.6.0
  [21ca0261] Transformers v0.2.7
⌃ [e88e6eb3] Zygote v0.6.63

I will try to narrow down the problem to test individual layers, but this is going to take bit of time.

It seems to me to be the effect of normalization layers and Float32, which was my suspect. I redid my mwe as follows

using ProfileSummarizer
using Transformers
using Flux
using TextEncodeBase
using NeuralAttentionlib
using HuggingFaceApi
using FiniteDifferences
using Zygote
using CUDA
using StatsBase


CUDA.device!(0)
textenc = HuggingFace.load_tokenizer("meta-llama/Llama-2-7b-chat-hf"; auth_token = access_token);
model = f32(HuggingFace.load_model("llama", "meta-llama/Llama-2-7b-chat-hf", "forCausallm", auth_token = access_token));
model = gpu(model);
embeddings = model.model.embedding.token.embeddings;
decoder = model.model.decoder;


function test_layer(layer;togpu = cpu, float_precision = f64, use_hidden = true)
	layer = togpu(float_precision(layer))
	tokens = TextEncodeBase.encode(textenc, "the most important text for this test").token
	θ = cpu(float_precision(embeddings * tokens))
	ii = [23, 5641, 18928, 4354, 16801, 294, 18929, 17793, 1277, 17098]
	sub_θ = θ[ii]
	function sub_f(sub_θ)
		θ[ii] = sub_θ
		f(θ)
	end

	f(θ)= use_hidden ? sum(layer((;hidden_state = togpu(θ))).hidden_state) : sum(layer(togpu(θ)))
	fin_gs = grad(central_fdm(5, 1), sub_f, sub_θ)[1];
	zyg_gs = Zygote.gradient(f, θ)[1][ii];
	abs.(fin_gs .- zyg_gs)
end

test_layer(decoder.layers[1][1].feedforward.layer;togpu = gpu, float_precision = f32) |> sort
test_layer(decoder.layers[1][1].feedforward.layer;togpu = cpu) |> sort
test_layer(decoder.layers[1][1].feedforward.norm;togpu = gpu, float_precision = f32, use_hidden = false) |> sort
test_layer(decoder.layers[1][1].feedforward.norm;togpu = cpu, float_precision = f32, use_hidden = false) |> sort
test_layer(decoder.layers[1][1].feedforward.norm;togpu = cpu, float_precision = f64, use_hidden = false) |> sort

For brevity, the output of last two tests showing execution on cpu with different precision gives

julia> test_layer(decoder.layers[1][1].feedforward.norm;togpu = cpu, float_precision = f32, use_hidden = false) |> sort
10-element Vector{Float32}:
 2.2888184f-5
 3.2424927f-5
 3.4332275f-5
 3.6239624f-5
 0.00011920929
 0.0010175705
 0.0033392906
 0.0042181015
 0.004609585
 0.051587105

julia> test_layer(decoder.layers[1][1].feedforward.norm;togpu = cpu, float_precision = f64, use_hidden = false) |> sort
10-element Vector{Float64}:
 1.602273869139026e-12
 6.565414878423326e-12
 7.716494110354688e-12
 8.521183758603001e-12
 9.283240842705709e-12
 6.4617158352930915e-6
 6.750580402670181e-6
 9.646580691580198e-6
 0.00020662468795151767
 0.0010416488879068098

Increasing precision therefore greatly increases accuracy of the gradient.

So the problem is in RMSLayerNorm. Not that it is wrong, but the finite difference might have large error, when the vectors have small norm.
This was my suspect from the beginning. I mark this as a solution.

1 Like

It seems you figured out and solved the problem?

I want to dive into using these models (with Julia, it seemed simple enough for you, had it worked the first time; or by using Python).

I want to ask why did you choose that model (and that size), and use Julia? Would Python have been easier? It gets 45.3 on MMLU (the largest 70B gets 68.9, so I guess you chose the smallest for resource limitations, what your GPU can handle).

Better and almost as good as the largest Llama 2:

Can read about it and try the demo here: https://platypus-llm.github.io/

You can get over 51 (for non-huge) with e.g.

Depending on the metric, I think these might be best:

MMLU (5-shot) 70.74

Uni-TianYan is a finetuned model from LLaMA2.
[…]

MMLU (5-shot) 69.91
TruthfulQA (0-shot) 65.81
Avg. 73.81

You can get 64.75 MMLU on this one from 3 weeks ago with at least the Llama 2 licence, yes, it’s a 35.3 GB file (also available down to 26.78 GB), maybe why you do not use that large:

Open large language models (LLMs) have traditionally been tailored for either textual or code-related tasks, with limited ability to effectively balance both. However, […]
In this work, we introduce Lemur-70B-v1 and Lemur-70B-chat-v1, the state-of-the-art open pretrained and supervised fine-tuned large language models balancing text and code intelligence.
[…]

Model type: Stable Beluga 7B is an auto-regressive language model fine-tuned on Llama2 7B.
[…]
License: Fine-tuned checkpoints (Stable Beluga 7B) is licensed under the STABLE BELUGA NON-COMMERCIAL COMMUNITY LICENSE AGREEMENT

This is the LLaMAfied version of Qwen/Qwen-7B-Chat, recalibrated to fit the original LLaMA/LLaMA-2-like model structure.

You can use LlamaForCausalLM for model inference, which is the same as LLaMA/LLaMA-2 models (using GPT2Tokenizer converted from the original tiktoken, by vonjack).

Llama 2 13b is a pretty decent language model. You know what’s probably better? Two Llama 2 13b models. In a trenchcoat.
Produced by bakllama.py with this config file:
[…]
This tells us two very important things:

  1. TruthfulQA is a perfect benchmark in every way.
  2. Llama models are amazingly robust to being fed their own output.

You can get slightly lower 43.99 MMLU
#Parms
1.13

This one is very intriguing:

mixed strategy: 100%Open-Platypus + ~1%Dolphin(GPT-4) + ~1%OpenOrca(GPT-4)

This one gets 70.26 MMLU:

gpt4
uni-tianyan/Uni-TianYan

Hi Palli,

thanks for the links.

Yes, I think I solved the problem.
To answer your question, why I have chosen Julia instead of Python has a very simple answer.
I like Julia. By solving issues, I learn a lot despite sometimes smaller progress. Transformers.jl is very good lib and serves me well. Moreover, I think we should not be afraid to use LLM with Julia.

6 Likes

Well we shouldn’t, but I was a bit, and you inspire me to use Julia (it’s seems easy with the MWE). I also like Julia, I believe and understand Python ecosystem may be better in some ways, at least for learning, if docs are missing or you encounter bugs in Julia not in Python.

Would you have avoided this specific issue with Python? Do you have good experience with LLMs in both languages? Even if Julia might be less scalable to train, it’s good to know if you could at least use LLMs (for inference).

To be honest, my experience with Python ecosystem is small. I delve into it only when I read and use reference implementation of some paper. In the case of this thread, there was no bug. The very same thing would happen in Python.

Transformers from huggingface have more features than Transformers.jl. Especially with stuff like mixed and low precision and distribution of models across multiple GPUs. But Julia’s Transformers.jl are super-fun to hack, because in nutshell, the lib is very transparent and that is why I enjoy the work with it. The biggest trouble I have faces with it that I sometimes do not know, how to download some model from huggingfacce.

4 Likes