Wrapping CUDA.jl with juliacall

Hello everyone!

I’m trying to make a Python wrapper using PythonCall.jl for a Julia package that uses CUDA.jl. The package initializes random integers from a UnitRange using the CPU’s Random.rand into a Matrix and then passes them to the GPU (I didn’t find an implementation that generates a CuArray like this directly on GPU). Unfortunately, if I repeatedly call this function, I get a [...] segmentation fault (core dumped) python [...].

I managed to get an MWE with a script like this:

import numpy as np
from juliacall import Main as jl

jl.seval("using CUDA")

jl.seval(
    "function bootstrap_indices(n_samples::Integer, n_bootstraps::Integer)\n"
    "ar = Array{Int32}(rand(1:n_samples, n_samples, n_bootstraps))\n"
    "ar_d = CuArray{Int32}(undef, n_samples, n_bootstraps)\n"
    "copyto!(ar_d, ar)\n"
    "CUDA.device_synchronize()\n"
    "return ar_d\n"
    "end"
)

def bootstrap_indices(n_samples=10000, n_bootstraps=100):
    ar_d = jl.bootstrap_indices(n_samples, n_bootstraps)

    return ar_d


n_reps = 1000
for _ in range(n_reps):
    ar_d = bootstrap_indices()

I don’t have to wait too long before that segfaults. Interestingly, if I use the the functions directly from juliacall’s Main like this:

import numpy as np
from juliacall import Main as jl

jl.seval("using CUDA")

def bootstrap_indices(n_samples=10000, n_bootstraps=100):
    ar = np.random.randint(
        1, high=n_samples + 1, size=(n_samples, n_bootstraps), dtype=np.int32
    )
    ar_d = jl.CuArray(ar)
    jl.CUDA.device_synchronize()

    return ar_d

n_reps = 1000
for _ in range(n_reps):
    ar_d = bootstrap_indices()

there’s no segfault (at least not that I’ve seen in the time that I let it run).

I also tried downgrading CUDA.jl and Julia versions as done here, but it seems to have done the problem worse (it segfaults faster).

Maybe by implementing the rand function to initialize that CuArray of random integers in the range I could go around the problem, but at the moment I’m not sure how I would implement that. Also passing CPU arrays to GPU seems to me good functionality that I may want to use in Julia and wrap with PythonCall.jl, so it’s a shame it segfaults (although I may be doing something wrong in the example above).

I don’t think this is the only place where my Julia package, wrapped with PythonCall.jl, segfaults when using CUDA implementations but it’s the first one that I managed to reproduce minimally.

I would appreciate it very much if someone has some insights about how to avoid these segfaults or guidance to implement a sort of CUDA.rand(1:n_samples, n_samples, n_bootstraps).

For reproducibility, the last time I run this I used:
Julia: version 1.12.1 (but it also happened with 1.11.7)
PythonCall.jl: version 0.9.28
CUDA.jl: version 5.9.2

A straightforward approach to get your ar_d directly on the device would be

ar_d = mod1.(CUDA.rand(Int32, n_samples, n_bootstraps), n_samples)

or

ar_d = CUDA.rand(Int32, n_samples, n_bootstraps); ar_d .= mod1.(ar_d, n_samples)

if you want to save on an allocation.

Not a particularly deep insight, but it might be useful to note that I get segfaults on my Linux machine, but everything works fine on a different Windows pc.

Versioninfo Linux
julia> versioninfo()
Julia Version 1.11.3
Commit d63adeda50d (2025-01-21 19:42 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 20 × Intel(R) Core(TM) i9-10900X CPU @ 3.70GHz
  WORD_SIZE: 64
  LLVM: libLLVM-16.0.6 (ORCJIT, cascadelake)
Threads: 20 default, 0 interactive, 10 GC (on 20 virtual cores)
Environment:
  LD_LIBRARY_PATH = usr/local/cuda/lib64:/usr/local/cuda-11.8/lib64
  JULIA_NUM_THREADS = auto

julia> CUDA.versioninfo()
CUDA toolchain: 
- runtime 12.9, artifact installation
- driver 560.35.3 for 12.6
- compiler 12.9

CUDA libraries: 
- CUBLAS: 12.9.1
- CURAND: 10.3.10
- CUFFT: 11.4.1
- CUSOLVER: 11.7.5
- CUSPARSE: 12.5.10
- CUPTI: 2025.2.1 (API 12.9.1)
- NVML: 12.0.0+560.35.3

Julia packages: 
- CUDA: 5.9.2
- CUDA_Driver_jll: 13.0.2+0
- CUDA_Compiler_jll: 0.3.0+0
- CUDA_Runtime_jll: 0.19.2+0

Toolchain:
- Julia: 1.11.3
- LLVM: 16.0.6

1 device:
  0: NVIDIA GeForce RTX 3080 Ti (sm_86, 10.881 GiB / 12.000 GiB available)

(Note sure where the LD_LIBRARY_PATH comes from, but CUDA.jl seems to ignore it anyway.)

The version of Python is 3.8.3, with juliacall 0.9.26. It’s worth checking if your MWE still segfaults on your machine when using an older version of Python.

Versioninfo Windows
julia> versioninfo()
Julia Version 1.11.5
Commit 760b2e5b73 (2025-04-14 06:53 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: 8 × Intel(R) Core(TM) i7-7700K CPU @ 4.20GHz
  WORD_SIZE: 64
  LLVM: libLLVM-16.0.6 (ORCJIT, skylake)
Threads: 8 default, 0 interactive, 4 GC (on 8 virtual cores)
Environment:
  JULIA_NUM_THREADS = auto

julia> CUDA.versioninfo()
CUDA runtime 12.9, artifact installation
CUDA driver 12.9
NVIDIA driver 577.0.0

CUDA libraries:
- CUBLAS: 12.9.1
- CURAND: 10.3.10
- CUFFT: 11.4.1
- CUSOLVER: 11.7.5
- CUSPARSE: 12.5.10
- CUPTI: 2025.2.1 (API 28.0.0)
- NVML: 12.0.0+577.0

Julia packages:
- CUDA: 5.8.2
- CUDA_Driver_jll: 0.13.1+0
- CUDA_Runtime_jll: 0.17.1+0

Toolchain:
- Julia: 1.11.5
- LLVM: 16.0.6

Preferences:
- CUDA_Runtime_jll.local: false

1 device:
  0: NVIDIA GeForce RTX 3070 (sm_86, 6.289 GiB / 8.000 GiB available)

The version of Python is 3.10.14, with juliacall 0.9.28.

Thanks for the idea! Using mod1 does the job and it’s much faster. I didn’t think of mapping larger integers back into the range and I didn’t know of the existence of this function. Thanks!

After you mentioned that, I tried the MWE with different combinations of Python (3.8, 3.10 and 3.12) and Julia versions (1.11, 1.12) and I found that the combination Python 3.10 with Julia 1.11 with the latest CUDA.jl and the latest PythonCall.jl (very similar to what you have on your windows PC) is much more stable than any other. It still segfaults if I run the MWE with n_reps = 100000 from time to time (it ran through that longer loop 7/10 times without a segfault). However, if I run it with GPU contention (I ran the MWE two times simultaneously, not sure if other type of tasks have the same behavior), it segfaults almost immediately.

I did these tests with two different GPUs, one of them also drives the monitors and the other one is completely idle before the tests. The one connected to the monitors always segfaults running the MWE regardless of the versions of Python and Julia. Maybe this could be related to the contention or maybe it’s something else.

Unfortunately, I don’t know how to check deeper into what happens with PythonCall.jl during the MWE (and maybe I wouldn’t understand it even if I did) but I’ll keep trying and testing different things.

On my Windows pc I can run at least 4 instances at the same time, both in Python 3.8.3 (with Julia 1.11.5) and Python 3.12.0 (with automatically installed Julia 1.11.7), while connected to a display. Maybe it’s just down to better driver support on Windows (assuming you’re on Linux or MacOS)?

Ah, thank you for testing that! I have been doing all my tests on Linux, so I guess it’s as you say and Windows just has less problematic drivers than Linux, unfortunately. I will still check if I can modify the Julia code of my package to make it more stable when wrapped with Python for Linux, because that’s probably the platform in which I would expect the most use and, if I find anything interesting, I will make sure to post it here. Thank you very much for your help!

1 Like