PyTorch (Transformers) leads to different results under PythonCall and native Python

Tomas_Pevny · January 22, 2024, 8:23am

Hello,

I am trying to add Microsoft phi model to Transformers.jl, since it seems like “small” language model to play with. I also wanted to become more aware of how the HuggingFace integration is done for the sake of knowledge.

My idea was that I will execute Julia and Python version in parallel and compare intermediate results. The ideal of course would be to execute both under same Julia session using PythonCall, such that I can directly compare the results. The problem I have run into is that I am getting different results when the same (hopefully the same) code is executed in PythonCall and in native python.

Let me show my MWE.

Julia version looks like this:

using PythonCall, CondaPkg, DLPack

Torch = pyimport("torch")
Transformers = pyimport("transformers")

model =Transformers.AutoModelForCausalLM.from_pretrained("microsoft/phi-1", torch_dtype="auto", trust_remote_code=true)
tokenizer = Transformers.AutoTokenizer.from_pretrained("microsoft/phi-1", trust_remote_code=true)

s = "Tell me something about Julia?"
inputs = tokenizer(s, return_tensors="pt", return_attention_mask=false)
e = model.model.embed_tokens(inputs.input_ids)
hidden_states = model.model.layers[0].input_layernorm(e)

Julia output

julia> inputs = tokenizer(s, return_tensors="pt", return_attention_mask=false)
Python: {'input_ids': tensor([[24446,   502,  1223,   546, 22300,    30]])}

julia> e = model.model.embed_tokens(inputs.input_ids)
Python:
tensor([[[-0.0312,  0.0060, -0.0284,  ..., -0.0190, -0.0157,  0.0041],
         [ 0.0246,  0.0117, -0.0089,  ..., -0.0542, -0.0132, -0.0856],
         [ 0.0204,  0.0095, -0.0438,  ..., -0.0451, -0.0419, -0.0366],
         [ 0.0491, -0.0586,  0.0671,  ...,  0.0373, -0.0188, -0.0319],
         [ 0.0177, -0.0105, -0.0267,  ..., -0.0295, -0.0215, -0.0230],
         [ 0.0750, -0.0425, -0.0025,  ...,  0.0589,  0.0373, -0.0148]]],
       grad_fn=<EmbeddingBackward0>)

julia> hidden_states = model.model.layers[0].input_layernorm(e)
Python:
tensor([[[-2.7861e-01,  4.1824e-02, -2.2774e-01,  ..., -1.5344e-01,
          -1.2405e-01,  1.9184e-02],
         [ 1.2535e-01,  5.9343e-02, -2.7328e-02,  ..., -3.0833e-01,
          -6.8912e-02, -2.5117e-01],
         [ 1.0916e-36,  3.4438e-41,  1.0916e-36,  ...,  3.4438e-41,
           1.0899e-36,  3.4438e-41],
         [ 1.0897e-36,  3.4438e-41,  1.0889e-36,  ...,  3.4438e-41,
           2.1460e+29,  4.5817e-41],
         [ 4.2328e+26,  4.5817e-41,  4.2328e+26,  ...,  3.4438e-41,
           1.0568e-36,  3.4438e-41],
         [ 1.5089e-37,  3.4438e-41,  1.0613e-36,  ...,  3.4438e-41,
           1.0501e-36,  3.4438e-41]]], grad_fn=<NativeLayerNormBackward0>)

The python version looks like this

import torch
import transformers

model =transformers.AutoModelForCausalLM.from_pretrained('microsoft/phi-1', torch_dtype='auto', trust_remote_code=True)
tokenizer = transformers.AutoTokenizer.from_pretrained('microsoft/phi-1', trust_remote_code=True)

s = 'Tell me something about Julia?'
inputs = tokenizer(s, return_tensors='pt', return_attention_mask=False)
m = model.model
e = model.model.embed_tokens(inputs.input_ids)
hidden_states = model.model.layers[0].input_layernorm(e)

Python output

>>> inputs
>>> inputs = tokenizer(s, return_tensors='pt', return_attention_mask=False)
{'input_ids': tensor([[24446,   502,  1223,   546, 22300,    30]])}
>>> e = model.model.embed_tokens(inputs.input_ids)
tensor([[[-0.0312,  0.0060, -0.0284,  ..., -0.0190, -0.0157,  0.0041],
         [ 0.0246,  0.0117, -0.0089,  ..., -0.0542, -0.0132, -0.0856],
         [ 0.0204,  0.0095, -0.0438,  ..., -0.0451, -0.0419, -0.0366],
         [ 0.0491, -0.0586,  0.0671,  ...,  0.0373, -0.0188, -0.0319],
         [ 0.0177, -0.0105, -0.0267,  ..., -0.0295, -0.0215, -0.0230],
         [ 0.0750, -0.0425, -0.0025,  ...,  0.0589,  0.0373, -0.0148]]],
       grad_fn=<EmbeddingBackward0>)
>>> hidden_states = model.model.layers[0].input_layernorm(e)
tensor([[[-0.2786,  0.0418, -0.2277,  ..., -0.1534, -0.1240,  0.0192],
         [ 0.1254,  0.0593, -0.0273,  ..., -0.3083, -0.0689, -0.2512],
         [ 0.1026,  0.0475, -0.2446,  ..., -0.2558, -0.2357, -0.1063],
         [ 0.2664, -0.3551,  0.4455,  ...,  0.2194, -0.1002, -0.0922],
         [ 0.1016, -0.0807, -0.1633,  ..., -0.1909, -0.1351, -0.0764],
         [ 0.4102, -0.2567,  0.0133,  ...,  0.3391,  0.2230, -0.0412]]],
       grad_fn=<NativeLayerNormBackward0>)

I executed the python by typing /private/tmp/.CondaPkg/env/bin/python as this seems to be the python environment used by PythonCall. After the layernorm is called, the output become very different.

Does anyone knows, what I am doing wrong?

Thanks for suggestions.

Tomas

Palli · January 22, 2024, 9:52am

That seems like a bug. You should be able to call all Python libraries. Only exceptions I can think of, I think it’s multi-threaded-safe. Could you try without threading, i.e. start Julia without non-default like -t auto.

I’m not sure if the Python side uses threads, e.g. torch adding (CPU) threads (dynamically). That may or may not be ok., could look into if it’s possible to disable. The GPU side is always threaded or parallel, and I think that’s never an issue, to worry about.

Could it simply be run-to-run difference, independent of PythonCall, and both results good, even though some numbers look/are very different? Because of non-determinism? Can you ask for deterministic results? By setting some random seed?

I didn’t look carefully, I see ~~slightly~~ different values, is it more than a round-off error? Then not too serious, but still would like to know of why, and how to get bit-identical? Some GPU setting?

Not a huge absolute or relative error here:

julia> -2.7861e-01 - -0.2786
-1.0000000000010001e-5

julia> -2.7861e-01 / -0.2786
1.0000358937544866

julia> -2.7861e-01 ≈ -0.2786
false

EDIT: very different:
julia> 4.2328e+26 - 0.4102
4.2328e26

Tomas_Pevny · January 22, 2024, 10:00am

Hi Palli,

thanks for responding. I execute the code on CPU. Also, the problem with threads seems weird to me, because python is executed in different process, therefore threads should not interact.

Palli · January 22, 2024, 10:05am

No, with PythonCall (and with PyCall, and related projects to cal in other direction; I believe also with RCall, but JavaCall ~~has the JVM in a separate process~~ EDIT: It’s in same, done with JNI), python, i.e. libpython is in the same process sharing memory. And that’s a good thing (usually, no copy overhead, it could though have issues with threading, also memory corruption…).

I just wan to you to know of this, and have it ruled out, though I’m not sure applying (maybe the other non-determinism plausible cause more likely):

https://docs.juliahub.com/General/PythonCall/0.9.14/faq/

Is PythonCall/JuliaCall thread safe?

No.

Some rules if you are writing multithreaded code:

Only call Python functions from the first thread.

You probably also need to call PythonCall.GC.disable() on the main thread before any threaded block of code. Remember to call PythonCall.GC.enable() again afterwards. (This is because Julia finalizers can be called from any thread.)

[…]

You may still encounter problems.

Related issues: #201, #202

And bug here relating to threading:

github.com/JuliaPy/PyCall.jl

thread safety

opened 01:48PM - 25 Feb 21 UTC

stevengj

bug

If Julia's `Threads.nthreads() > 1`, we may want to do some more work to ensure …thread safety. In particular: * Call `PyEval_InitThreads()` in `__init__` (this is only needed for Python ≤ 3.6, and is a no-op in Python ≥ 3.7). * Acquire the GIL in the PyObject `finalizer`, by calling `PyGILState_Ensure` (returns an [`enum`](https://github.com/python/cpython/blob/63298930fb531ba2bb4f23bc3b915dbf1e17e9e1/Include/pystate.h#L95-L96), i.e. a `Cint` in Julia) and `PyGILState_Release`. This is because the Julia GC may be called from threads other than the main thread, so we need to ensure that we hold Python's GIL before decref-ing. We might also expose an API to acquire the GIL, but in general I would recommend that user code should only call Python from the main thread. See [this forum post](https://discourse.julialang.org/t/help-with-pycall-related-segfault/55998?u=stevengj) and #881 for a possible test case.

Tomas_Pevny · January 22, 2024, 10:22am

OK, this is interesting. My bad. But when I run julia and execute the code as
julia --check-bounds=yes -t 1 --project=., I am still getting wrong results.

Palli · January 22, 2024, 10:23am

You can shell to whatever process, e.g. python (or what ever else or even a different Julia) to call Python in a separate process, from the Julia side, if that’s what you want, or to rule out issues with same process used.

You can also shell out from the Python side, or use, subprocess — Subprocess management — Python 3.12.1 documentation

[And FYI, There’s the recent new Melt.jl for Julia’s in separate processes, a replacement for using DIstributed.]

[There’s also alternative projects to call Python, one recent not yet announced, that if I recall doesn’t use the same process.]

Palli · January 22, 2024, 10:31am

Still get wrong, but same results? Or do both using Python only get slightly different? And both with PythonCall slightly different, and across to Python-only more different?

I’m no expert but I at least found this (and maybe your different results are ok, similar enough, but you can look at this if you insist on bit-identical):

https://pytorch.org/docs/stable/generated/torch.use_deterministic_algorithms.html

Sets whether PyTorch operations must use “deterministic” algorithms. That is, algorithms which, given the same input, and when run on the same software and hardware, always produce the same output. […]

Note

torch.set_deterministic_debug_mode() offers an alternative interface for this feature.

https://pytorch.org/docs/stable/notes/randomness.html

Completely reproducible results are not guaranteed across PyTorch releases, individual commits, or different platforms. Furthermore, results may not be reproducible between CPU and GPU executions, even when using identical seeds.

However, there are some steps you can take to limit the number of sources of nondeterministic behavior for a specific platform, device, and PyTorch release. […]

Warning

Deterministic operations are often slower than nondeterministic operations, so single-run performance may decrease for your model. However, determinism may save time in development by facilitating experimentation, debugging, and regression testing.

Controlling sources of randomness

PyTorch random number generator

You can use torch.manual_seed() to seed the RNG for all devices (both CPU and CUDA):

import torch torch.manual_seed(0)

Some PyTorch operations may use random numbers internally. torch.svd_lowrank() does this, for instance.

gist.github.com

https://gist.github.com/Guitaricet/28fbb2a753b1bb888ef0b2731c03c031

reproducibility.md


## Reproducibility

ML experiments may be very hard to reproduce. You have a lot of hyperparameters, different dataset splits, different ways to preprocess your data, bugs, etc.
Ideally, you should log data split (already preprocessed), all hyperparameters (including learning rate scheduling), the initial state of your model and optimizer, random seeds used for initialization, dataset shuffling and all of your code. Your GPU is also should be in deterministic mode (which is not the default mode). **For every single model run**. This is a very hard task. Different random seed can significantly [change your metrics](https://arxiv.org/abs/2002.06305) and even GPU-induced randomness can be important. We're not solving all of these problems, but we need to address at least what we can handle.

For every result you report in the paper you need (at least) to:
1. Track your model and optimizer hyperparameters (including learning rate schedule)
1. Save final model parameters
1. Report all of the parameters in the paper (make a table in the appendix) and release the code

This file has been truncated. show original

Tomas_Pevny · January 22, 2024, 10:31am

Can you give me please an example, how I will test that the problems I observe is caused by julia and libpython being executed under the same memory space?

Palli · January 22, 2024, 10:40am

I can’t give you an example why same process causes a problem (not involving threads or memory corruption). I don’t think it’s a real worry. I mean memory corruption can be an issue within Julia process, and could be not limited to Julia if you write out of bounds, could spread to Python (since it’s in-process), but I don’t think that’s the case here, I think you eliminated that possibility.

For a memory corruption you could ~~likely~~ get a segfault then, though no guarantee. If that’s your worry then --check-bounds=yes will eliminate the problem.

While I say eliminate, if you call e.g. C code, from Julia or Python then all bets are off, if it has bugs.

If Julia and Python would NOT share a process, then both could have their separate threads (or not), no problem. It’s NOT possible to have threads (in any language) shared across process boundaries, but it IS possible within, and then you must take care, i.e. the programmer to do it correctly in a thread-safe way. And it’s not currently supported, so don’t try…, I think if you call from Julia without threads to Python and it uses threads it may though be ok.

I would look at the other possibility now, in my previous comment.

Palli · January 22, 2024, 10:55am

The phi-1 model is very intriguing. @Tomas_Pevny Note, there’s now also phi-1.5 and then phi-2, which was “updated 4 days ago”:

It’s great that you’re experimenting with phi from Julia. I suppose that question could though be answered from Python only, or a web interface, and e.g. all coding advice from LLMs only needs to only invoke the model somehow, e.g. calling some API can be enough.

What’s really intriguing, and capturing my mind right now are these two new breakthroughs, just now published in Nature this year (likely not open source, but AlphaCodium is):

https://www.nature.com/articles/s41586-023-06747-5

Q: How much time did you spend on “prompt engineering” compared to “flow engineering”?

A: Structured output almost completely eliminates the need for simple prompt engineering. We estimate that ~95% of the time we did more high-level design, reasoning, injecting data at the correct places, …, a.k.a. “flow engineering”.

Q: Is this project relevant only to specific programming languages?

A: No. The proposed flow is language agnostic. We generated solutions in Python, but the flow can be applied to any language.

Q: How did you manage the context window?

A: We used models with a context window of 8192 tokens, and we did not encounter cases where it did not suffice. However, […]

Q: Is this work “realistic” in terms of the number of LLM calls?

A: In comparison to AlphaCode, we do four orders of magnitude (!) fewer calls (per solution AlphaCodium does 15-20 calls). […]

Also in Nature recently:

Palli · January 22, 2024, 11:19am

I can see why you want exact same numbers (not realistic though? or needed?) for confidence, but I think those models may behave the same, so you could at least try running the model, with e.g. that question. Do you get garbage out, or seemingly same answer, or very similar or good (better?) answer?

Note also, running models is also non-deterministic, so you may not expect same one, each time, unless you set temp to 0, I believe, and maybe some seed also. I.e. technically all models/neural networks (even the brain?) is deterministic, since not updated, but like with chess programs, are not in practice, to not get the same (boring) answers each time, to same questions (plus context history).

[Necessarily the brain updates its “model”, i.e. is continuously learning, but if you could switch it off, it would also be deterministic; if you could actually put in such an environment… We want, or some of us…, LLMs to also be continuously learning… And that problem catastrophic forgetting is solved, at least in a limited way, so it may not be far off for LLMs.]

Tomas_Pevny · January 22, 2024, 12:05pm

The problem is that Phi architecture is not yet in transformers.jl, so I need to add the loader, which I never did before. So I want to know that I am approximately correct modulo small differences. Therefore my idea is to compare layer by layer (or block by block).

I will probably execute phi in python and save the outputs and then load them to Julia. If I am succesfull and have a bit time, I will write it small howto to simplify the endeavor for others.

Tomas_Pevny · January 22, 2024, 2:17pm

I think I got totally stuck with how to debug Rope embedding. I do not have a faintest idea, how to debug it, because the Julia and Python code are very differently.

The phi is using RotaryPositionalEmbedding, but Transformers.jl has so many nested calls, that I do not know, how to pierce them. Anyone knows, how to get around?

mkitti · January 23, 2024, 3:20am

This is not true. Although, running the JVM in a separate process should be strongly considered.

Topic		Replies	Views
Julia call from Python3 running in single core General Usage	34	3924	December 2, 2016
Calling Julia from PyTorch GPU pycall , pytorch , pyjulia	4	422	July 10, 2023
PyTorch and Julia Machine Learning	12	15340	March 27, 2019
Construct and intermittently call a single Julia instance with JuliaCall Tooling python , juliacall , pyjulia	4	589	April 14, 2023
Convert torch tensor in julia type with PyCall.jl General Usage question	2	760	May 12, 2022

Note

Warning

Controlling sources of randomness

PyTorch random number generator

Related topics