Hello,
I am trying to add Microsoft phi model to Transformers.jl
, since it seems like “small” language model to play with. I also wanted to become more aware of how the HuggingFace integration is done for the sake of knowledge.
My idea was that I will execute Julia and Python version in parallel and compare intermediate results. The ideal of course would be to execute both under same Julia session using PythonCall
, such that I can directly compare the results. The problem I have run into is that I am getting different results when the same (hopefully the same) code is executed in PythonCall
and in native python.
Let me show my MWE.
Julia version looks like this:
using PythonCall, CondaPkg, DLPack
Torch = pyimport("torch")
Transformers = pyimport("transformers")
model =Transformers.AutoModelForCausalLM.from_pretrained("microsoft/phi-1", torch_dtype="auto", trust_remote_code=true)
tokenizer = Transformers.AutoTokenizer.from_pretrained("microsoft/phi-1", trust_remote_code=true)
s = "Tell me something about Julia?"
inputs = tokenizer(s, return_tensors="pt", return_attention_mask=false)
e = model.model.embed_tokens(inputs.input_ids)
hidden_states = model.model.layers[0].input_layernorm(e)
Julia output
julia> inputs = tokenizer(s, return_tensors="pt", return_attention_mask=false)
Python: {'input_ids': tensor([[24446, 502, 1223, 546, 22300, 30]])}
julia> e = model.model.embed_tokens(inputs.input_ids)
Python:
tensor([[[-0.0312, 0.0060, -0.0284, ..., -0.0190, -0.0157, 0.0041],
[ 0.0246, 0.0117, -0.0089, ..., -0.0542, -0.0132, -0.0856],
[ 0.0204, 0.0095, -0.0438, ..., -0.0451, -0.0419, -0.0366],
[ 0.0491, -0.0586, 0.0671, ..., 0.0373, -0.0188, -0.0319],
[ 0.0177, -0.0105, -0.0267, ..., -0.0295, -0.0215, -0.0230],
[ 0.0750, -0.0425, -0.0025, ..., 0.0589, 0.0373, -0.0148]]],
grad_fn=<EmbeddingBackward0>)
julia> hidden_states = model.model.layers[0].input_layernorm(e)
Python:
tensor([[[-2.7861e-01, 4.1824e-02, -2.2774e-01, ..., -1.5344e-01,
-1.2405e-01, 1.9184e-02],
[ 1.2535e-01, 5.9343e-02, -2.7328e-02, ..., -3.0833e-01,
-6.8912e-02, -2.5117e-01],
[ 1.0916e-36, 3.4438e-41, 1.0916e-36, ..., 3.4438e-41,
1.0899e-36, 3.4438e-41],
[ 1.0897e-36, 3.4438e-41, 1.0889e-36, ..., 3.4438e-41,
2.1460e+29, 4.5817e-41],
[ 4.2328e+26, 4.5817e-41, 4.2328e+26, ..., 3.4438e-41,
1.0568e-36, 3.4438e-41],
[ 1.5089e-37, 3.4438e-41, 1.0613e-36, ..., 3.4438e-41,
1.0501e-36, 3.4438e-41]]], grad_fn=<NativeLayerNormBackward0>)
The python version looks like this
import torch
import transformers
model =transformers.AutoModelForCausalLM.from_pretrained('microsoft/phi-1', torch_dtype='auto', trust_remote_code=True)
tokenizer = transformers.AutoTokenizer.from_pretrained('microsoft/phi-1', trust_remote_code=True)
s = 'Tell me something about Julia?'
inputs = tokenizer(s, return_tensors='pt', return_attention_mask=False)
m = model.model
e = model.model.embed_tokens(inputs.input_ids)
hidden_states = model.model.layers[0].input_layernorm(e)
Python output
>>> inputs
>>> inputs = tokenizer(s, return_tensors='pt', return_attention_mask=False)
{'input_ids': tensor([[24446, 502, 1223, 546, 22300, 30]])}
>>> e = model.model.embed_tokens(inputs.input_ids)
tensor([[[-0.0312, 0.0060, -0.0284, ..., -0.0190, -0.0157, 0.0041],
[ 0.0246, 0.0117, -0.0089, ..., -0.0542, -0.0132, -0.0856],
[ 0.0204, 0.0095, -0.0438, ..., -0.0451, -0.0419, -0.0366],
[ 0.0491, -0.0586, 0.0671, ..., 0.0373, -0.0188, -0.0319],
[ 0.0177, -0.0105, -0.0267, ..., -0.0295, -0.0215, -0.0230],
[ 0.0750, -0.0425, -0.0025, ..., 0.0589, 0.0373, -0.0148]]],
grad_fn=<EmbeddingBackward0>)
>>> hidden_states = model.model.layers[0].input_layernorm(e)
tensor([[[-0.2786, 0.0418, -0.2277, ..., -0.1534, -0.1240, 0.0192],
[ 0.1254, 0.0593, -0.0273, ..., -0.3083, -0.0689, -0.2512],
[ 0.1026, 0.0475, -0.2446, ..., -0.2558, -0.2357, -0.1063],
[ 0.2664, -0.3551, 0.4455, ..., 0.2194, -0.1002, -0.0922],
[ 0.1016, -0.0807, -0.1633, ..., -0.1909, -0.1351, -0.0764],
[ 0.4102, -0.2567, 0.0133, ..., 0.3391, 0.2230, -0.0412]]],
grad_fn=<NativeLayerNormBackward0>)
I executed the python by typing /private/tmp/.CondaPkg/env/bin/python
as this seems to be the python environment used by PythonCall
. After the layernorm is called, the output become very different.
Does anyone knows, what I am doing wrong?
Thanks for suggestions.
Tomas