PythonCall allocations

I am trying to understand how to minimize allocations in PythonCall. Here, I am using stim, a python library, to generate samples (stored in numpy arrays). I have two implementations:

function generate_samples(;circ_string, num_samples=1, batch_seeds_ranges=[1:100, 150:200] )
    PythonCall.@pyexec (circ_string, num_samples, batch_seeds_ranges) => """
    import numpy as np
    import stim
    # Set up the circuit and sampler
    circ_stim = stim.Circuit(circ_string)
    nshots= sum(len(seeds) for seeds in batch_seeds_ranges) * num_samples
    ndets = circ_stim.num_detectors
    nobs = circ_stim.num_observables
    det_shots_np =  np.zeros(shape=(nshots, (ndets + 7) // 8 ),dtype=np.uint8)
    obs_shots_np =  np.zeros(shape=(nshots, (nobs + 7) // 8),dtype=np.uint8)
    counter = 0
    for seeds in batch_seeds_ranges:
        for idx,seed in enumerate(seeds):
            sampler = circ_stim.compile_detector_sampler(seed=seed)
            sampler.sample(num_samples, separate_observables = True, bit_packed=True, dets_out=det_shots_np[counter:counter+num_samples], obs_out=obs_shots_np[counter:counter+num_samples])
            counter += num_samples
    """ => (det_shots_np, obs_shots_np, ndets, nobs);
    return det_shots_np, obs_shots_np, ndets, nobs
end

function generate_samples_wrapper(circ_string, num_samples = 1, batch_seeds_ranges=[1:100, 150:200])
    circ_stim = stim.Circuit(circ_string)
    ndets = circ_stim.num_detectors
    nobs = circ_stim.num_observables
    nshots= sum(length.(batch_seeds_ranges)) * num_samples
    det_shots_np =  np.zeros(shape=(nshots, Int(ceil(pyconvert(Int, ndets)/ 8)) ),dtype=np.uint8)
    obs_shots_np =  np.zeros(shape=(nshots, Int(ceil(pyconvert(Int, nobs)/ 8))),dtype=np.uint8)
    counter = 0
    for seeds in batch_seeds_ranges
        for (idx,seed) in enumerate(seeds)
            sampler = circ_stim.compile_detector_sampler(seed=seed)
            sampler.sample(num_samples, separate_observables = true, bit_packed=true, dets_out= det_shots_np[counter: counter+num_samples-1], obs_out= obs_shots_np[counter:counter+num_samples-1])
            PythonCall.Core.pydel!(sampler)
            counter += num_samples
        end
    end
    return det_shots_np, obs_shots_np, ndets, nobs
end

julia> @time det_shots_np, obs_shots_np, _, nobs=  generate_samples(;circ_string);
  0.130245 seconds (156 allocations: 3.430 KiB)

julia> @time det_shots_np, obs_shots_np, _, nobs=  generate_samples_wrapper(circ_string);
  0.140203 seconds (8.38 k allocations: 178.477 KiB)

I do not understand why the second implementation has such a higher allocations count. Any insight would be appreciated.
Using the first implementation would be nice, but unfortunately I run into errors when I try to run it on multiple workers.

Put your code in triple backtick blocks, like this
```
```
Not only would it be formatted to be readable, you also wouldn’t accidentally tag other users with @. Incidentally, isolated expressions like that are put in single backtick quotes like ` `, and you can read them because I’m escaping special characters with \.

From what I can eyeball at the moment, you’re getting overhead from the numerous PythonCall wrappers crossing the language barrier; by comparison, the @pyexec code crosses from Julia to Python back to Julia. There may also be extra overhead from some wrappers not being type-stable across the language barrier, which you can check with @code_warntype. Also note that Julia is unable to profile allocations that occur in Python itself, so depending on how large the Python allocations are, this extra overhead may not even matter.

Thanks for the formatting advice! very helpful!
This is what I get when I profile allocations for the julia wrapper one:

If I understand correctly, the both functions might be doing similar number of heap allocations. But the allocations made by Python are not counted, that’s why mostly Python-based function has lower count?

I also tried to use \@code_warntype and it does not show an issue for now:

julia> @code_warntype generate_samples(;circ_string)
MethodInstance for Core.kwcall(::@NamedTuple{circ_string::String}, ::typeof(generate_samples))
  from kwcall(::NamedTuple, ::typeof(generate_samples)) @ Main /pscratch/sd/r/rismail/mwe_stor/mwe_soft/scripts/play.jl:34
Arguments
  _::Core.Const(Core.kwcall)
  @_2::@NamedTuple{circ_string::String}
  @_3::Core.Const(Main.generate_samples)
Locals
  @_4::Union{Int64, String, Vector{UnitRange{Int64}}}
  circ_string::String
  num_samples::Int64
  batch_seeds_ranges::Vector{UnitRange{Int64}}
Body::NTuple{4, Py}
1 ──       Core.NewvarNode(:(@_4))
β”‚    %2  = Core.isdefined(@_2, :circ_string)::Core.Const(true)
└───       goto #3 if not %2
2 ──       (@_4 = Core.getfield(@_2, :circ_string))
└───       goto #4
3 ──       Core.Const(:(Core.UndefKeywordError(:circ_string)))
└───       Core.Const(:(@_4 = Core.throw(%6)))
4 ┄─ %8  = @_4::String
β”‚          (circ_string = %8)
β”‚    %10 = Core.isdefined(@_2, :num_samples)::Core.Const(false)
└───       goto #6 if not %10
5 ──       Core.Const(:(@_4 = Core.getfield(@_2, :num_samples)))
└───       Core.Const(:(goto %15))
6 ┄─       (@_4 = 1)
β”‚    %15 = @_4::Core.Const(1)
β”‚          (num_samples = %15)
β”‚    %17 = Core.isdefined(@_2, :batch_seeds_ranges)::Core.Const(false)
└───       goto #8 if not %17
7 ──       Core.Const(:(@_4 = Core.getfield(@_2, :batch_seeds_ranges)))
└───       Core.Const(:(goto %24))
8 ┄─ %21 = (1:100)::Core.Const(1:100)
β”‚    %22 = (150:200)::Core.Const(150:200)
β”‚          (@_4 = Base.vect(%21, %22))
β”‚    %24 = @_4::Vector{UnitRange{Int64}}
β”‚          (batch_seeds_ranges = %24)
β”‚    %26 = Base.keys(@_2)::Core.Const((:circ_string,))
β”‚    %27 = (:circ_string, :num_samples, :batch_seeds_ranges)::Core.Const((:circ_string, :num_samples, :batch_seeds_ranges))
β”‚    %28 = Base.diff_names(%26, %27)::Core.Const(())
β”‚    %29 = Base.isempty(%28)::Core.Const(true)
└───       goto #10 if not %29
9 ──       goto #11
10 ─       Core.Const(:(Base.kwerr(@_2, @_3)))
11 β”„ %33 = Main.:(var"#generate_samples#23")::Core.Const(Main.var"#generate_samples#23")
β”‚    %34 = circ_string::String
β”‚    %35 = num_samples::Core.Const(1)
β”‚    %36 = batch_seeds_ranges::Vector{UnitRange{Int64}}
β”‚    %37 = (%33)(%34, %35, %36, @_3)::NTuple{4, Py}
└───       return %37


julia> @code_warntype generate_samples_wrapper(circ_string)
MethodInstance for generate_samples_wrapper(::String)
  from generate_samples_wrapper(circ_string) @ Main REPL[5]:1
Arguments
  #self#::Core.Const(Main.generate_samples_wrapper)
  circ_string::String
Body::NTuple{4, Py}
1 ─ %1 = (1:100)::Core.Const(1:100)
β”‚   %2 = (150:200)::Core.Const(150:200)
β”‚   %3 = Base.vect(%1, %2)::Vector{UnitRange{Int64}}
β”‚   %4 = (#self#)(circ_string, 1, %3)::NTuple{4, Py}
└──      return %4

Right.

Your @code_warntype printouts actually don’t show the bulk of what the methods do, just handling generate_samples’s keyword arguments or generate_samples_wrapper’s default positional arguments before calling the methods that do the real work:

@code_warntype also generally does not recursively infer the bodies of other methods being called. That doesn’t matter for generate_samples because it’s not going to see anything going on in the Python code string anyway (which is a bit of a relief because tracking down the underlying method of a keyworded method is a pain manually, better use Cthulhu.jl), but you’ll want to check @code_warntype generate_samples_wrapper(circ_string, 1, [1:100, 150:200]). Also note that the Py wrapper may be technically type stable and look fine, but that’s deceptive; it just affirms that PythonCall is not distinguishing the types on the Python side for the Julia side. Not entirely sure what all of det_shots_np, obs_shots_np, ndets, nobs can be, but the first two could be Matrix{UInt8} or whatever the PyArray wrapper is.

If I were you I’d not worry much about the allocations - the two snippets of code run just as fast so why care?

Reducing the allocations is usually only relevant when micro-optimising tight loops where any allocations become a bottleneck, whereas in this case the allocations add a pretty small overhead (β‰ˆ10% maybe).

My main motivation is that the second implementation using Julia wrappers produces memory leaks that lead to out-of-memory (OOM) events. I’m not entirely sure why this occurs, but I suspect it’s due to allocations that are managed on both the Julia and Python (and C as well since stim uses C) sides without being properly cleared. This is why I was experimenting with reducing such allocations using the pyexec code, and it managed to avoid the memory leak.

On the Julia, Python, or C side?