PyTorch (Transformers) leads to different results under PythonCall and native Python


I am trying to add Microsoft phi model to Transformers.jl, since it seems like “small” language model to play with. I also wanted to become more aware of how the HuggingFace integration is done for the sake of knowledge.

My idea was that I will execute Julia and Python version in parallel and compare intermediate results. The ideal of course would be to execute both under same Julia session using PythonCall, such that I can directly compare the results. The problem I have run into is that I am getting different results when the same (hopefully the same) code is executed in PythonCall and in native python.

Let me show my MWE.

Julia version looks like this:

using PythonCall, CondaPkg, DLPack

Torch = pyimport("torch")
Transformers = pyimport("transformers")

model =Transformers.AutoModelForCausalLM.from_pretrained("microsoft/phi-1", torch_dtype="auto", trust_remote_code=true)
tokenizer = Transformers.AutoTokenizer.from_pretrained("microsoft/phi-1", trust_remote_code=true)

s = "Tell me something about Julia?"
inputs = tokenizer(s, return_tensors="pt", return_attention_mask=false)
e = model.model.embed_tokens(inputs.input_ids)
hidden_states = model.model.layers[0].input_layernorm(e)
Julia output
julia> inputs = tokenizer(s, return_tensors="pt", return_attention_mask=false)
Python: {'input_ids': tensor([[24446,   502,  1223,   546, 22300,    30]])}

julia> e = model.model.embed_tokens(inputs.input_ids)
tensor([[[-0.0312,  0.0060, -0.0284,  ..., -0.0190, -0.0157,  0.0041],
         [ 0.0246,  0.0117, -0.0089,  ..., -0.0542, -0.0132, -0.0856],
         [ 0.0204,  0.0095, -0.0438,  ..., -0.0451, -0.0419, -0.0366],
         [ 0.0491, -0.0586,  0.0671,  ...,  0.0373, -0.0188, -0.0319],
         [ 0.0177, -0.0105, -0.0267,  ..., -0.0295, -0.0215, -0.0230],
         [ 0.0750, -0.0425, -0.0025,  ...,  0.0589,  0.0373, -0.0148]]],

julia> hidden_states = model.model.layers[0].input_layernorm(e)
tensor([[[-2.7861e-01,  4.1824e-02, -2.2774e-01,  ..., -1.5344e-01,
          -1.2405e-01,  1.9184e-02],
         [ 1.2535e-01,  5.9343e-02, -2.7328e-02,  ..., -3.0833e-01,
          -6.8912e-02, -2.5117e-01],
         [ 1.0916e-36,  3.4438e-41,  1.0916e-36,  ...,  3.4438e-41,
           1.0899e-36,  3.4438e-41],
         [ 1.0897e-36,  3.4438e-41,  1.0889e-36,  ...,  3.4438e-41,
           2.1460e+29,  4.5817e-41],
         [ 4.2328e+26,  4.5817e-41,  4.2328e+26,  ...,  3.4438e-41,
           1.0568e-36,  3.4438e-41],
         [ 1.5089e-37,  3.4438e-41,  1.0613e-36,  ...,  3.4438e-41,
           1.0501e-36,  3.4438e-41]]], grad_fn=<NativeLayerNormBackward0>)

The python version looks like this

import torch
import transformers

model =transformers.AutoModelForCausalLM.from_pretrained('microsoft/phi-1', torch_dtype='auto', trust_remote_code=True)
tokenizer = transformers.AutoTokenizer.from_pretrained('microsoft/phi-1', trust_remote_code=True)

s = 'Tell me something about Julia?'
inputs = tokenizer(s, return_tensors='pt', return_attention_mask=False)
m = model.model
e = model.model.embed_tokens(inputs.input_ids)
hidden_states = model.model.layers[0].input_layernorm(e)
Python output
>>> inputs
>>> inputs = tokenizer(s, return_tensors='pt', return_attention_mask=False)
{'input_ids': tensor([[24446,   502,  1223,   546, 22300,    30]])}
>>> e = model.model.embed_tokens(inputs.input_ids)
tensor([[[-0.0312,  0.0060, -0.0284,  ..., -0.0190, -0.0157,  0.0041],
         [ 0.0246,  0.0117, -0.0089,  ..., -0.0542, -0.0132, -0.0856],
         [ 0.0204,  0.0095, -0.0438,  ..., -0.0451, -0.0419, -0.0366],
         [ 0.0491, -0.0586,  0.0671,  ...,  0.0373, -0.0188, -0.0319],
         [ 0.0177, -0.0105, -0.0267,  ..., -0.0295, -0.0215, -0.0230],
         [ 0.0750, -0.0425, -0.0025,  ...,  0.0589,  0.0373, -0.0148]]],
>>> hidden_states = model.model.layers[0].input_layernorm(e)
tensor([[[-0.2786,  0.0418, -0.2277,  ..., -0.1534, -0.1240,  0.0192],
         [ 0.1254,  0.0593, -0.0273,  ..., -0.3083, -0.0689, -0.2512],
         [ 0.1026,  0.0475, -0.2446,  ..., -0.2558, -0.2357, -0.1063],
         [ 0.2664, -0.3551,  0.4455,  ...,  0.2194, -0.1002, -0.0922],
         [ 0.1016, -0.0807, -0.1633,  ..., -0.1909, -0.1351, -0.0764],
         [ 0.4102, -0.2567,  0.0133,  ...,  0.3391,  0.2230, -0.0412]]],

I executed the python by typing /private/tmp/.CondaPkg/env/bin/python as this seems to be the python environment used by PythonCall. After the layernorm is called, the output become very different.

Does anyone knows, what I am doing wrong?

Thanks for suggestions.


1 Like

That seems like a bug. You should be able to call all Python libraries. Only exceptions I can think of, I think it’s multi-threaded-safe. Could you try without threading, i.e. start Julia without non-default like -t auto.

I’m not sure if the Python side uses threads, e.g. torch adding (CPU) threads (dynamically). That may or may not be ok., could look into if it’s possible to disable. The GPU side is always threaded or parallel, and I think that’s never an issue, to worry about.

Could it simply be run-to-run difference, independent of PythonCall, and both results good, even though some numbers look/are very different? Because of non-determinism? Can you ask for deterministic results? By setting some random seed?

I didn’t look carefully, I see slightly different values, is it more than a round-off error? Then not too serious, but still would like to know of why, and how to get bit-identical? Some GPU setting?

Not a huge absolute or relative error here:

julia> -2.7861e-01 - -0.2786

julia> -2.7861e-01 / -0.2786

julia> -2.7861e-01 ≈ -0.2786

EDIT: very different:
julia> 4.2328e+26 - 0.4102
1 Like

Hi Palli,

thanks for responding. I execute the code on CPU. Also, the problem with threads seems weird to me, because python is executed in different process, therefore threads should not interact.

No, with PythonCall (and with PyCall, and related projects to cal in other direction; I believe also with RCall, but JavaCall has the JVM in a separate process EDIT: It’s in same, done with JNI), python, i.e. libpython is in the same process sharing memory. And that’s a good thing (usually, no copy overhead, it could though have issues with threading, also memory corruption…).

I just wan to you to know of this, and have it ruled out, though I’m not sure applying (maybe the other non-determinism plausible cause more likely):

Is PythonCall/JuliaCall thread safe?


Some rules if you are writing multithreaded code:

  • Only call Python functions from the first thread.
  • You probably also need to call PythonCall.GC.disable() on the main thread before any threaded block of code. Remember to call PythonCall.GC.enable() again afterwards. (This is because Julia finalizers can be called from any thread.)
  • […]
  • You may still encounter problems.

Related issues: #201, #202

And bug here relating to threading:

1 Like

OK, this is interesting. My bad. But when I run julia and execute the code as
julia --check-bounds=yes -t 1 --project=., I am still getting wrong results.

1 Like

You can shell to whatever process, e.g. python (or what ever else or even a different Julia) to call Python in a separate process, from the Julia side, if that’s what you want, or to rule out issues with same process used.

You can also shell out from the Python side, or use, subprocess — Subprocess management — Python 3.12.1 documentation

[And FYI, There’s the recent new Melt.jl for Julia’s in separate processes, a replacement for using DIstributed.]

[There’s also alternative projects to call Python, one recent not yet announced, that if I recall doesn’t use the same process.]

Still get wrong, but same results? Or do both using Python only get slightly different? And both with PythonCall slightly different, and across to Python-only more different?

I’m no expert but I at least found this (and maybe your different results are ok, similar enough, but you can look at this if you insist on bit-identical):

Sets whether PyTorch operations must use “deterministic” algorithms. That is, algorithms which, given the same input, and when run on the same software and hardware, always produce the same output. […]


torch.set_deterministic_debug_mode() offers an alternative interface for this feature.

Completely reproducible results are not guaranteed across PyTorch releases, individual commits, or different platforms. Furthermore, results may not be reproducible between CPU and GPU executions, even when using identical seeds.

However, there are some steps you can take to limit the number of sources of nondeterministic behavior for a specific platform, device, and PyTorch release. […]


Deterministic operations are often slower than nondeterministic operations, so single-run performance may decrease for your model. However, determinism may save time in development by facilitating experimentation, debugging, and regression testing.

Controlling sources of randomness

PyTorch random number generator

You can use torch.manual_seed() to seed the RNG for all devices (both CPU and CUDA):

import torch torch.manual_seed(0)

Some PyTorch operations may use random numbers internally. torch.svd_lowrank() does this, for instance.

Can you give me please an example, how I will test that the problems I observe is caused by julia and libpython being executed under the same memory space?

I can’t give you an example why same process causes a problem (not involving threads or memory corruption). I don’t think it’s a real worry. I mean memory corruption can be an issue within Julia process, and could be not limited to Julia if you write out of bounds, could spread to Python (since it’s in-process), but I don’t think that’s the case here, I think you eliminated that possibility.

For a memory corruption you could likely get a segfault then, though no guarantee. If that’s your worry then --check-bounds=yes will eliminate the problem.

While I say eliminate, if you call e.g. C code, from Julia or Python then all bets are off, if it has bugs.

If Julia and Python would NOT share a process, then both could have their separate threads (or not), no problem. It’s NOT possible to have threads (in any language) shared across process boundaries, but it IS possible within, and then you must take care, i.e. the programmer to do it correctly in a thread-safe way. And it’s not currently supported, so don’t try…, I think if you call from Julia without threads to Python and it uses threads it may though be ok.

I would look at the other possibility now, in my previous comment.

The phi-1 model is very intriguing. @Tomas_Pevny Note, there’s now also phi-1.5 and then phi-2, which was “updated 4 days ago”:

It’s great that you’re experimenting with phi from Julia. I suppose that question could though be answered from Python only, or a web interface, and e.g. all coding advice from LLMs only needs to only invoke the model somehow, e.g. calling some API can be enough.

What’s really intriguing, and capturing my mind right now are these two new breakthroughs, just now published in Nature this year (likely not open source, but AlphaCodium is):

Q: How much time did you spend on “prompt engineering” compared to “flow engineering”?

A: Structured output almost completely eliminates the need for simple prompt engineering. We estimate that ~95% of the time we did more high-level design, reasoning, injecting data at the correct places, …, a.k.a. “flow engineering”.

Q: Is this project relevant only to specific programming languages?

A: No. The proposed flow is language agnostic. We generated solutions in Python, but the flow can be applied to any language.

Q: How did you manage the context window?

A: We used models with a context window of 8192 tokens, and we did not encounter cases where it did not suffice. However, […]

Q: Is this work “realistic” in terms of the number of LLM calls?

A: In comparison to AlphaCode, we do four orders of magnitude (!) fewer calls (per solution AlphaCodium does 15-20 calls). […]

Also in Nature recently:

1 Like

I can see why you want exact same numbers (not realistic though? or needed?) for confidence, but I think those models may behave the same, so you could at least try running the model, with e.g. that question. Do you get garbage out, or seemingly same answer, or very similar or good (better?) answer?

Note also, running models is also non-deterministic, so you may not expect same one, each time, unless you set temp to 0, I believe, and maybe some seed also. I.e. technically all models/neural networks (even the brain?) is deterministic, since not updated, but like with chess programs, are not in practice, to not get the same (boring) answers each time, to same questions (plus context history).

[Necessarily the brain updates its “model”, i.e. is continuously learning, but if you could switch it off, it would also be deterministic; if you could actually put in such an environment… We want, or some of us…, LLMs to also be continuously learning… And that problem catastrophic forgetting is solved, at least in a limited way, so it may not be far off for LLMs.]

The problem is that Phi architecture is not yet in transformers.jl, so I need to add the loader, which I never did before. So I want to know that I am approximately correct modulo small differences. Therefore my idea is to compare layer by layer (or block by block).

I will probably execute phi in python and save the outputs and then load them to Julia. If I am succesfull and have a bit time, I will write it small howto to simplify the endeavor for others.


I think I got totally stuck with how to debug Rope embedding. I do not have a faintest idea, how to debug it, because the Julia and Python code are very differently.

The phi is using RotaryPositionalEmbedding, but Transformers.jl has so many nested calls, that I do not know, how to pierce them. Anyone knows, how to get around?

1 Like

This is not true. Although, running the JVM in a separate process should be strongly considered.

1 Like