Hello everyone!
I’m trying to make a Python wrapper using PythonCall.jl for a Julia package that uses CUDA.jl. The package initializes random integers from a UnitRange using the CPU’s Random.rand into a Matrix and then passes them to the GPU (I didn’t find an implementation that generates a CuArray like this directly on GPU). Unfortunately, if I repeatedly call this function, I get a [...] segmentation fault (core dumped) python [...].
I managed to get an MWE with a script like this:
import numpy as np
from juliacall import Main as jl
jl.seval("using CUDA")
jl.seval(
    "function bootstrap_indices(n_samples::Integer, n_bootstraps::Integer)\n"
    "ar = Array{Int32}(rand(1:n_samples, n_samples, n_bootstraps))\n"
    "ar_d = CuArray{Int32}(undef, n_samples, n_bootstraps)\n"
    "copyto!(ar_d, ar)\n"
    "CUDA.device_synchronize()\n"
    "return ar_d\n"
    "end"
)
def bootstrap_indices(n_samples=10000, n_bootstraps=100):
    ar_d = jl.bootstrap_indices(n_samples, n_bootstraps)
    return ar_d
n_reps = 1000
for _ in range(n_reps):
    ar_d = bootstrap_indices()
I don’t have to wait too long before that segfaults. Interestingly, if I use the the functions directly from juliacall’s Main like this:
import numpy as np
from juliacall import Main as jl
jl.seval("using CUDA")
def bootstrap_indices(n_samples=10000, n_bootstraps=100):
    ar = np.random.randint(
        1, high=n_samples + 1, size=(n_samples, n_bootstraps), dtype=np.int32
    )
    ar_d = jl.CuArray(ar)
    jl.CUDA.device_synchronize()
    return ar_d
n_reps = 1000
for _ in range(n_reps):
    ar_d = bootstrap_indices()
there’s no segfault (at least not that I’ve seen in the time that I let it run).
I also tried downgrading CUDA.jl and Julia versions as done here, but it seems to have done the problem worse (it segfaults faster).
Maybe by implementing the rand function to initialize that CuArray of random integers in the range I could go around the problem, but at the moment I’m not sure how I would implement that. Also passing CPU arrays to GPU seems to me good functionality that I may want to use in Julia and wrap with PythonCall.jl, so it’s a shame it segfaults (although I may be doing something wrong in the example above).
I don’t think this is the only place where my Julia package, wrapped with PythonCall.jl, segfaults when using CUDA implementations but it’s the first one that I managed to reproduce minimally.
I would appreciate it very much if someone has some insights about how to avoid these segfaults or guidance to implement a sort of CUDA.rand(1:n_samples, n_samples, n_bootstraps).
For reproducibility, the last time I run this I used:
Julia: version 1.12.1 (but it also happened with 1.11.7)
PythonCall.jl: version 0.9.28
CUDA.jl: version 5.9.2