PythonCall allocations

I am trying to understand how to minimize allocations in PythonCall. Here, I am using stim, a python library, to generate samples (stored in numpy arrays). I have two implementations:

function generate_samples(;circ_string, num_samples=1, batch_seeds_ranges=[1:100, 150:200] )
    PythonCall.@pyexec (circ_string, num_samples, batch_seeds_ranges) => """
    import numpy as np
    import stim
    # Set up the circuit and sampler
    circ_stim = stim.Circuit(circ_string)
    nshots= sum(len(seeds) for seeds in batch_seeds_ranges) * num_samples
    ndets = circ_stim.num_detectors
    nobs = circ_stim.num_observables
    det_shots_np =  np.zeros(shape=(nshots, (ndets + 7) // 8 ),dtype=np.uint8)
    obs_shots_np =  np.zeros(shape=(nshots, (nobs + 7) // 8),dtype=np.uint8)
    counter = 0
    for seeds in batch_seeds_ranges:
        for idx,seed in enumerate(seeds):
            sampler = circ_stim.compile_detector_sampler(seed=seed)
            sampler.sample(num_samples, separate_observables = True, bit_packed=True, dets_out=det_shots_np[counter:counter+num_samples], obs_out=obs_shots_np[counter:counter+num_samples])
            counter += num_samples
    """ => (det_shots_np, obs_shots_np, ndets, nobs);
    return det_shots_np, obs_shots_np, ndets, nobs
end

function generate_samples_wrapper(circ_string, num_samples = 1, batch_seeds_ranges=[1:100, 150:200])
    circ_stim = stim.Circuit(circ_string)
    ndets = circ_stim.num_detectors
    nobs = circ_stim.num_observables
    nshots= sum(length.(batch_seeds_ranges)) * num_samples
    det_shots_np =  np.zeros(shape=(nshots, Int(ceil(pyconvert(Int, ndets)/ 8)) ),dtype=np.uint8)
    obs_shots_np =  np.zeros(shape=(nshots, Int(ceil(pyconvert(Int, nobs)/ 8))),dtype=np.uint8)
    counter = 0
    for seeds in batch_seeds_ranges
        for (idx,seed) in enumerate(seeds)
            sampler = circ_stim.compile_detector_sampler(seed=seed)
            sampler.sample(num_samples, separate_observables = true, bit_packed=true, dets_out= det_shots_np[counter: counter+num_samples-1], obs_out= obs_shots_np[counter:counter+num_samples-1])
            PythonCall.Core.pydel!(sampler)
            counter += num_samples
        end
    end
    return det_shots_np, obs_shots_np, ndets, nobs
end

julia> @time det_shots_np, obs_shots_np, _, nobs=  generate_samples(;circ_string);
  0.130245 seconds (156 allocations: 3.430 KiB)

julia> @time det_shots_np, obs_shots_np, _, nobs=  generate_samples_wrapper(circ_string);
  0.140203 seconds (8.38 k allocations: 178.477 KiB)

I do not understand why the second implementation has such a higher allocations count. Any insight would be appreciated.
Using the first implementation would be nice, but unfortunately I run into errors when I try to run it on multiple workers.

Put your code in triple backtick blocks, like this
```
```
Not only would it be formatted to be readable, you also wouldn’t accidentally tag other users with @. Incidentally, isolated expressions like that are put in single backtick quotes like ` `, and you can read them because I’m escaping special characters with \.

From what I can eyeball at the moment, you’re getting overhead from the numerous PythonCall wrappers crossing the language barrier; by comparison, the @pyexec code crosses from Julia to Python back to Julia. There may also be extra overhead from some wrappers not being type-stable across the language barrier, which you can check with @code_warntype. Also note that Julia is unable to profile allocations that occur in Python itself, so depending on how large the Python allocations are, this extra overhead may not even matter.

1 Like

Thanks for the formatting advice! very helpful!
This is what I get when I profile allocations for the julia wrapper one:

If I understand correctly, the both functions might be doing similar number of heap allocations. But the allocations made by Python are not counted, that’s why mostly Python-based function has lower count?

I also tried to use \@code_warntype and it does not show an issue for now:

julia> @code_warntype generate_samples(;circ_string)
MethodInstance for Core.kwcall(::@NamedTuple{circ_string::String}, ::typeof(generate_samples))
  from kwcall(::NamedTuple, ::typeof(generate_samples)) @ Main /pscratch/sd/r/rismail/mwe_stor/mwe_soft/scripts/play.jl:34
Arguments
  _::Core.Const(Core.kwcall)
  @_2::@NamedTuple{circ_string::String}
  @_3::Core.Const(Main.generate_samples)
Locals
  @_4::Union{Int64, String, Vector{UnitRange{Int64}}}
  circ_string::String
  num_samples::Int64
  batch_seeds_ranges::Vector{UnitRange{Int64}}
Body::NTuple{4, Py}
1 ──       Core.NewvarNode(:(@_4))
β”‚    %2  = Core.isdefined(@_2, :circ_string)::Core.Const(true)
└───       goto #3 if not %2
2 ──       (@_4 = Core.getfield(@_2, :circ_string))
└───       goto #4
3 ──       Core.Const(:(Core.UndefKeywordError(:circ_string)))
└───       Core.Const(:(@_4 = Core.throw(%6)))
4 ┄─ %8  = @_4::String
β”‚          (circ_string = %8)
β”‚    %10 = Core.isdefined(@_2, :num_samples)::Core.Const(false)
└───       goto #6 if not %10
5 ──       Core.Const(:(@_4 = Core.getfield(@_2, :num_samples)))
└───       Core.Const(:(goto %15))
6 ┄─       (@_4 = 1)
β”‚    %15 = @_4::Core.Const(1)
β”‚          (num_samples = %15)
β”‚    %17 = Core.isdefined(@_2, :batch_seeds_ranges)::Core.Const(false)
└───       goto #8 if not %17
7 ──       Core.Const(:(@_4 = Core.getfield(@_2, :batch_seeds_ranges)))
└───       Core.Const(:(goto %24))
8 ┄─ %21 = (1:100)::Core.Const(1:100)
β”‚    %22 = (150:200)::Core.Const(150:200)
β”‚          (@_4 = Base.vect(%21, %22))
β”‚    %24 = @_4::Vector{UnitRange{Int64}}
β”‚          (batch_seeds_ranges = %24)
β”‚    %26 = Base.keys(@_2)::Core.Const((:circ_string,))
β”‚    %27 = (:circ_string, :num_samples, :batch_seeds_ranges)::Core.Const((:circ_string, :num_samples, :batch_seeds_ranges))
β”‚    %28 = Base.diff_names(%26, %27)::Core.Const(())
β”‚    %29 = Base.isempty(%28)::Core.Const(true)
└───       goto #10 if not %29
9 ──       goto #11
10 ─       Core.Const(:(Base.kwerr(@_2, @_3)))
11 β”„ %33 = Main.:(var"#generate_samples#23")::Core.Const(Main.var"#generate_samples#23")
β”‚    %34 = circ_string::String
β”‚    %35 = num_samples::Core.Const(1)
β”‚    %36 = batch_seeds_ranges::Vector{UnitRange{Int64}}
β”‚    %37 = (%33)(%34, %35, %36, @_3)::NTuple{4, Py}
└───       return %37


julia> @code_warntype generate_samples_wrapper(circ_string)
MethodInstance for generate_samples_wrapper(::String)
  from generate_samples_wrapper(circ_string) @ Main REPL[5]:1
Arguments
  #self#::Core.Const(Main.generate_samples_wrapper)
  circ_string::String
Body::NTuple{4, Py}
1 ─ %1 = (1:100)::Core.Const(1:100)
β”‚   %2 = (150:200)::Core.Const(150:200)
β”‚   %3 = Base.vect(%1, %2)::Vector{UnitRange{Int64}}
β”‚   %4 = (#self#)(circ_string, 1, %3)::NTuple{4, Py}
└──      return %4

Right.

Your @code_warntype printouts actually don’t show the bulk of what the methods do, just handling generate_samples’s keyword arguments or generate_samples_wrapper’s default positional arguments before calling the methods that do the real work:

@code_warntype also generally does not recursively infer the bodies of other methods being called. That doesn’t matter for generate_samples because it’s not going to see anything going on in the Python code string anyway (which is a bit of a relief because tracking down the underlying method of a keyworded method is a pain manually, better use Cthulhu.jl), but you’ll want to check @code_warntype generate_samples_wrapper(circ_string, 1, [1:100, 150:200]). Also note that the Py wrapper may be technically type stable and look fine, but that’s deceptive; it just affirms that PythonCall is not distinguishing the types on the Python side for the Julia side. Not entirely sure what all of det_shots_np, obs_shots_np, ndets, nobs can be, but the first two could be Matrix{UInt8} or whatever the PyArray wrapper is.

If I were you I’d not worry much about the allocations - the two snippets of code run just as fast so why care?

Reducing the allocations is usually only relevant when micro-optimising tight loops where any allocations become a bottleneck, whereas in this case the allocations add a pretty small overhead (β‰ˆ10% maybe).

3 Likes

My main motivation is that the second implementation using Julia wrappers produces memory leaks that lead to out-of-memory (OOM) events. I’m not entirely sure why this occurs, but I suspect it’s due to allocations that are managed on both the Julia and Python (and C as well since stim uses C) sides without being properly cleared. This is why I was experimenting with reducing such allocations using the pyexec code, and it managed to avoid the memory leak.

On the Julia, Python, or C side?