Mixing CUDA.jl with external GPU compute (OpenMM / DLPack.jl)

axsk · October 9, 2025, 2:45pm

I am using OpenMM, a molecular dynamics package, through PyCall.
Now I want to combine my Julia integrator with the OpenMM force computation (on GPU).

To this end I tried using DLPack.jl together with openmm-dlext (many thanks to @PabloZubieta!) which if I understand correctly is passing the GPU memory (adresses) from OpenMM to Julia, such that I can access the OpenMM internal positions/forces through Julias CuArrays.

So far, very cool

However, I am now stumbling into performance problems I do not understand.
In order to compute the forces (compforce) I launch OpenMM’s force calculation via a PyCall to OpenMM’s context.getState.

I then want to read out the forces, which requires some rescaling on the GPU (readforce)

Both operations for themselves are fast (@benchmark shows 40us and 10us respectively).
Sequentially however they are very slow (>300us), much slower than just copying the memory from GPU to CPU from within OpenMM/Python (~100us).

For the combined call the profiler shows all time spent in the pycall.

For context here is (the DLPack part of) my implementation

Code

using PyCall
import DLPack

struct DLForce7{S,T,P,F, FO}
    pysim::PyObject
    positions::CuArray{Float32, 2}
    forces::CuArray{Int, 2}  # yes, thats how the forces are passed
    forceout::CuArray{Float32, 2}
    scaling::Float32
end

function add_dlpack2(pysim)
    dlext = pyimport("openmm.dlext")
    cupy = pyimport("cupy")
    dlforce = dlext.Force()
    pysim.system.addForce(dlforce)
    dlview = dlforce.view(pysim.context)
    positions = DLPack.from_dlpack(cupy.from_dlpack(pycall(dlext.positions, PyObject, dlview), copy=false))
    forces = DLPack.from_dlpack(cupy.from_dlpack(pycall(dlext.forces, PyObject, dlview), copy=false))

    f2 = force(sim, coords(sim))
    scaling = Float32(f2[1] / collect(forces)[1])
    
    DLForce7(pysim, positions, forces, forces .* scaling, scaling)
end

coords(dl::DLForce7) = dl.positions
setcoords(dl::DLForce7, x) = (dl.positions  .= x)

compforce(dl::DLForce7) = pycall(dl.pysim.context.getState, Nothing, getForces=true)
readforce(dl::DLForce7) = (dl.forceout .*= dl.forces .* dl.scaling)

force(dl::DLForce7) = (compforce(dl); readforce(dl))
force(dl::DLForce7, x) = (setcoords(dl, x); force(dl))

Benchmarks

julia> @benchmark OpenMM.readforce(dl)
BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample.
 Range (min … max):  10.606 μs … 142.563 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     11.056 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   11.592 μs ±   2.548 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▁▆▇██▇▆▅▄▄▃▁▁              ▁▂▂▃▃▂▃▂▁▁                        ▂
  ██████████████▇█▅▇▇▅▆▅▅▄▅▇███████████████▇▇▄▆▅▆▆▆▆▅▆▆▆▅▆▆▅▅▅ █
  10.6 μs       Histogram: log(frequency) by time      16.3 μs <

 Memory estimate: 2.55 KiB, allocs estimate: 79.

julia> @benchmark OpenMM.compforce(dl)
BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample.
 Range (min … max):  29.945 μs …  2.535 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     31.637 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   32.442 μs ± 25.973 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

     ▁▄▇█▇█▆▄                                                  
  ▂▃▅█████████▆▄▄▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▂▂▂▂ ▃
  29.9 μs         Histogram: frequency by time        42.5 μs <

 Memory estimate: 896 bytes, allocs estimate: 16.

julia> @benchmark (OpenMM.compforce(dl); OpenMM.readforce(dl))
BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample.
 Range (min … max):  293.398 μs …  3.462 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     323.793 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   323.201 μs ± 33.031 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

                                  ▅▅▁      ▃█▄                  
  ▂▁▁▁▁▁▁▁▁▁▁▁▂▂▁▁▁▂▁▂▁▂▂▂▁▂▂▂▂▂▂▆███▆▄▃▃▃▄███▇▄▃▃▃▃▂▂▂▂▂▂▂▂▂▂ ▃
  293 μs          Histogram: frequency by time          338 μs <

 Memory estimate: 3.42 KiB, allocs estimate: 95.

Am I overseeing something?

Best, Alex

Topic		Replies	Views
Julia call from Python3 running in single core General Usage	34	4035	December 2, 2016
Lammps interface to julia vs. python General Usage	4	1764	January 5, 2018
Calling Julia from HPC code Julia at Scale ccall , fortran , hpc , parallel , mpi	3	1799	September 11, 2018
How to get started with GPU programming? OpenCL or CUDA? GPU	7	7316	August 29, 2017
[ANN] DLPack.jl - Share CPU and CUDA arrays between Julia and Python Package Announcements interoperability , cuda , python , pytorch , jax	0	997	February 21, 2022

Mixing CUDA.jl with external GPU compute (OpenMM / DLPack.jl)

Related topics