PythonCall allocations

Refaat_Ismail · May 9, 2025, 11:29pm

I am trying to understand how to minimize allocations in PythonCall. Here, I am using stim, a python library, to generate samples (stored in numpy arrays). I have two implementations:

function generate_samples(;circ_string, num_samples=1, batch_seeds_ranges=[1:100, 150:200] )
    PythonCall.@pyexec (circ_string, num_samples, batch_seeds_ranges) => """
    import numpy as np
    import stim
    # Set up the circuit and sampler
    circ_stim = stim.Circuit(circ_string)
    nshots= sum(len(seeds) for seeds in batch_seeds_ranges) * num_samples
    ndets = circ_stim.num_detectors
    nobs = circ_stim.num_observables
    det_shots_np =  np.zeros(shape=(nshots, (ndets + 7) // 8 ),dtype=np.uint8)
    obs_shots_np =  np.zeros(shape=(nshots, (nobs + 7) // 8),dtype=np.uint8)
    counter = 0
    for seeds in batch_seeds_ranges:
        for idx,seed in enumerate(seeds):
            sampler = circ_stim.compile_detector_sampler(seed=seed)
            sampler.sample(num_samples, separate_observables = True, bit_packed=True, dets_out=det_shots_np[counter:counter+num_samples], obs_out=obs_shots_np[counter:counter+num_samples])
            counter += num_samples
    """ => (det_shots_np, obs_shots_np, ndets, nobs);
    return det_shots_np, obs_shots_np, ndets, nobs
end

function generate_samples_wrapper(circ_string, num_samples = 1, batch_seeds_ranges=[1:100, 150:200])
    circ_stim = stim.Circuit(circ_string)
    ndets = circ_stim.num_detectors
    nobs = circ_stim.num_observables
    nshots= sum(length.(batch_seeds_ranges)) * num_samples
    det_shots_np =  np.zeros(shape=(nshots, Int(ceil(pyconvert(Int, ndets)/ 8)) ),dtype=np.uint8)
    obs_shots_np =  np.zeros(shape=(nshots, Int(ceil(pyconvert(Int, nobs)/ 8))),dtype=np.uint8)
    counter = 0
    for seeds in batch_seeds_ranges
        for (idx,seed) in enumerate(seeds)
            sampler = circ_stim.compile_detector_sampler(seed=seed)
            sampler.sample(num_samples, separate_observables = true, bit_packed=true, dets_out= det_shots_np[counter: counter+num_samples-1], obs_out= obs_shots_np[counter:counter+num_samples-1])
            PythonCall.Core.pydel!(sampler)
            counter += num_samples
        end
    end
    return det_shots_np, obs_shots_np, ndets, nobs
end

julia> @time det_shots_np, obs_shots_np, _, nobs=  generate_samples(;circ_string);
  0.130245 seconds (156 allocations: 3.430 KiB)

julia> @time det_shots_np, obs_shots_np, _, nobs=  generate_samples_wrapper(circ_string);
  0.140203 seconds (8.38 k allocations: 178.477 KiB)

I do not understand why the second implementation has such a higher allocations count. Any insight would be appreciated.
Using the first implementation would be nice, but unfortunately I run into errors when I try to run it on multiple workers.

Benny · May 9, 2025, 11:44pm

Put your code in triple backtick blocks, like this
```
```
Not only would it be formatted to be readable, you also wouldn’t accidentally tag other users with @. Incidentally, isolated expressions like that are put in single backtick quotes like ` `, and you can read them because I’m escaping special characters with \.

From what I can eyeball at the moment, you’re getting overhead from the numerous PythonCall wrappers crossing the language barrier; by comparison, the @pyexec code crosses from Julia to Python back to Julia. There may also be extra overhead from some wrappers not being type-stable across the language barrier, which you can check with @code_warntype. Also note that Julia is unable to profile allocations that occur in Python itself, so depending on how large the Python allocations are, this extra overhead may not even matter.

Refaat_Ismail · May 10, 2025, 12:48am

Thanks for the formatting advice! very helpful!
This is what I get when I profile allocations for the julia wrapper one:

If I understand correctly, the both functions might be doing similar number of heap allocations. But the allocations made by Python are not counted, that’s why mostly Python-based function has lower count?

I also tried to use \@code_warntype and it does not show an issue for now:

julia> @code_warntype generate_samples(;circ_string)
MethodInstance for Core.kwcall(::@NamedTuple{circ_string::String}, ::typeof(generate_samples))
  from kwcall(::NamedTuple, ::typeof(generate_samples)) @ Main /pscratch/sd/r/rismail/mwe_stor/mwe_soft/scripts/play.jl:34
Arguments
  _::Core.Const(Core.kwcall)
  @_2::@NamedTuple{circ_string::String}
  @_3::Core.Const(Main.generate_samples)
Locals
  @_4::Union{Int64, String, Vector{UnitRange{Int64}}}
  circ_string::String
  num_samples::Int64
  batch_seeds_ranges::Vector{UnitRange{Int64}}
Body::NTuple{4, Py}
1 ──       Core.NewvarNode(:(@_4))
│    %2  = Core.isdefined(@_2, :circ_string)::Core.Const(true)
└───       goto #3 if not %2
2 ──       (@_4 = Core.getfield(@_2, :circ_string))
└───       goto #4
3 ──       Core.Const(:(Core.UndefKeywordError(:circ_string)))
└───       Core.Const(:(@_4 = Core.throw(%6)))
4 ┄─ %8  = @_4::String
│          (circ_string = %8)
│    %10 = Core.isdefined(@_2, :num_samples)::Core.Const(false)
└───       goto #6 if not %10
5 ──       Core.Const(:(@_4 = Core.getfield(@_2, :num_samples)))
└───       Core.Const(:(goto %15))
6 ┄─       (@_4 = 1)
│    %15 = @_4::Core.Const(1)
│          (num_samples = %15)
│    %17 = Core.isdefined(@_2, :batch_seeds_ranges)::Core.Const(false)
└───       goto #8 if not %17
7 ──       Core.Const(:(@_4 = Core.getfield(@_2, :batch_seeds_ranges)))
└───       Core.Const(:(goto %24))
8 ┄─ %21 = (1:100)::Core.Const(1:100)
│    %22 = (150:200)::Core.Const(150:200)
│          (@_4 = Base.vect(%21, %22))
│    %24 = @_4::Vector{UnitRange{Int64}}
│          (batch_seeds_ranges = %24)
│    %26 = Base.keys(@_2)::Core.Const((:circ_string,))
│    %27 = (:circ_string, :num_samples, :batch_seeds_ranges)::Core.Const((:circ_string, :num_samples, :batch_seeds_ranges))
│    %28 = Base.diff_names(%26, %27)::Core.Const(())
│    %29 = Base.isempty(%28)::Core.Const(true)
└───       goto #10 if not %29
9 ──       goto #11
10 ─       Core.Const(:(Base.kwerr(@_2, @_3)))
11 ┄ %33 = Main.:(var"#generate_samples#23")::Core.Const(Main.var"#generate_samples#23")
│    %34 = circ_string::String
│    %35 = num_samples::Core.Const(1)
│    %36 = batch_seeds_ranges::Vector{UnitRange{Int64}}
│    %37 = (%33)(%34, %35, %36, @_3)::NTuple{4, Py}
└───       return %37


julia> @code_warntype generate_samples_wrapper(circ_string)
MethodInstance for generate_samples_wrapper(::String)
  from generate_samples_wrapper(circ_string) @ Main REPL[5]:1
Arguments
  #self#::Core.Const(Main.generate_samples_wrapper)
  circ_string::String
Body::NTuple{4, Py}
1 ─ %1 = (1:100)::Core.Const(1:100)
│   %2 = (150:200)::Core.Const(150:200)
│   %3 = Base.vect(%1, %2)::Vector{UnitRange{Int64}}
│   %4 = (#self#)(circ_string, 1, %3)::NTuple{4, Py}
└──      return %4

stevengj · May 10, 2025, 12:58am

Right.

Benny · May 10, 2025, 6:02am

Your @code_warntype printouts actually don’t show the bulk of what the methods do, just handling generate_samples’s keyword arguments or generate_samples_wrapper’s default positional arguments before calling the methods that do the real work:

@code_warntype also generally does not recursively infer the bodies of other methods being called. That doesn’t matter for generate_samples because it’s not going to see anything going on in the Python code string anyway (which is a bit of a relief because tracking down the underlying method of a keyworded method is a pain manually, better use Cthulhu.jl), but you’ll want to check @code_warntype generate_samples_wrapper(circ_string, 1, [1:100, 150:200]). Also note that the Py wrapper may be technically type stable and look fine, but that’s deceptive; it just affirms that PythonCall is not distinguishing the types on the Python side for the Julia side. Not entirely sure what all of det_shots_np, obs_shots_np, ndets, nobs can be, but the first two could be Matrix{UInt8} or whatever the PyArray wrapper is.

cjdoris · May 10, 2025, 9:54am

If I were you I’d not worry much about the allocations - the two snippets of code run just as fast so why care?

Reducing the allocations is usually only relevant when micro-optimising tight loops where any allocations become a bottleneck, whereas in this case the allocations add a pretty small overhead (≈10% maybe).

Refaat_Ismail · May 20, 2025, 3:54am

My main motivation is that the second implementation using Julia wrappers produces memory leaks that lead to out-of-memory (OOM) events. I’m not entirely sure why this occurs, but I suspect it’s due to allocations that are managed on both the Julia and Python (and C as well since stim uses C) sides without being properly cleared. This is why I was experimenting with reducing such allocations using the pyexec code, and it managed to avoid the memory leak.

Benny · May 20, 2025, 6:47am

On the Julia, Python, or C side?

Topic		Replies	Views
Does @benchmark accurately track PyCall allocations? Performance	2	291	March 17, 2021
Minimizing PythonCall overhead with BenchmarkTools Performance benchmarktools , python , pythoncall	4	334	June 9, 2023
How to copy the result of python to Julia without allocation memory using PyCall Performance package , pycall	8	396	December 12, 2023
PyCall minimal overhead Performance	15	2494	June 12, 2023
Using GPU via PyCall causes non-reusable memory allocation GPU question , pycall , gpu , pytorch , garbage-collection	4	1122	February 16, 2021

PythonCall allocations

Related topics