I am using OpenMM, a molecular dynamics package, through PyCall.
Now I want to combine my Julia integrator with the OpenMM force computation (on GPU).
To this end I tried using DLPack.jl together with openmm-dlext (many thanks to @PabloZubieta!) which if I understand correctly is passing the GPU memory (adresses) from OpenMM to Julia, such that I can access the OpenMM internal positions/forces through Julias CuArrays.
Very cool
However, I am now stumbling into performance problems I do not understand.
In order to compute the forces (compforce
) I launch OpenMMβs force calculation via a PyCall to OpenMMβs context.getState
.
I then want to read out the forces, which requires some rescaling on the GPU (readforce
)
Both operations for themselves are fast (@benchmark shows 40us and 10us respectively).
Sequentially however they are very slow (>300us), much slower than just copying the memory from GPU to CPU from within OpenMM/Python (~100us).
For the combined call the profiler shows all time spent in the pycall
.
For context here is (the DLPack part of) my implementation
Code
using PyCall
import DLPack
struct DLForce7{S,T,P,F, FO}
pysim::PyObject
positions::CuArray{Float32, 2}
forces::CuArray{Int, 2} # yes, thats how the forces are passed
forceout::CuArray{Float32, 2}
scaling::Float32
end
function add_dlpack2(pysim)
dlext = pyimport("openmm.dlext")
cupy = pyimport("cupy")
dlforce = dlext.Force()
pysim.system.addForce(dlforce)
dlview = dlforce.view(pysim.context)
positions = DLPack.from_dlpack(cupy.from_dlpack(pycall(dlext.positions, PyObject, dlview), copy=false))
forces = DLPack.from_dlpack(cupy.from_dlpack(pycall(dlext.forces, PyObject, dlview), copy=false))
f2 = force(sim, coords(sim))
scaling = Float32(f2[1] / collect(forces)[1])
DLForce7(pysim, positions, forces, forces .* scaling, scaling)
end
coords(dl::DLForce7) = dl.positions
setcoords(dl::DLForce7, x) = (dl.positions .= x)
compforce(dl::DLForce7) = pycall(dl.pysim.context.getState, Nothing, getForces=true)
readforce(dl::DLForce7) = (dl.forceout .*= dl.forces .* dl.scaling)
force(dl::DLForce7) = (compforce(dl); readforce(dl))
force(dl::DLForce7, x) = (setcoords(dl, x); force(dl))
Benchmarks
julia> @benchmark OpenMM.readforce(dl)
BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample.
Range (min β¦ max): 10.606 ΞΌs β¦ 142.563 ΞΌs β GC (min β¦ max): 0.00% β¦ 0.00%
Time (median): 11.056 ΞΌs β GC (median): 0.00%
Time (mean Β± Ο): 11.592 ΞΌs Β± 2.548 ΞΌs β GC (mean Β± Ο): 0.00% Β± 0.00%
ββββββββ
βββββ ββββββββββ β
βββββββββββββββββ
βββ
ββ
β
ββ
βββββββββββββββββββββ
βββββ
ββββ
βββ
β
β
β
10.6 ΞΌs Histogram: log(frequency) by time 16.3 ΞΌs <
Memory estimate: 2.55 KiB, allocs estimate: 79.
julia> @benchmark OpenMM.compforce(dl)
BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample.
Range (min β¦ max): 29.945 ΞΌs β¦ 2.535 ms β GC (min β¦ max): 0.00% β¦ 0.00%
Time (median): 31.637 ΞΌs β GC (median): 0.00%
Time (mean Β± Ο): 32.442 ΞΌs Β± 25.973 ΞΌs β GC (mean Β± Ο): 0.00% Β± 0.00%
ββββββββ
βββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
29.9 ΞΌs Histogram: frequency by time 42.5 ΞΌs <
Memory estimate: 896 bytes, allocs estimate: 16.
julia> @benchmark (OpenMM.compforce(dl); OpenMM.readforce(dl))
BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample.
Range (min β¦ max): 293.398 ΞΌs β¦ 3.462 ms β GC (min β¦ max): 0.00% β¦ 0.00%
Time (median): 323.793 ΞΌs β GC (median): 0.00%
Time (mean Β± Ο): 323.201 ΞΌs Β± 33.031 ΞΌs β GC (mean Β± Ο): 0.00% Β± 0.00%
β
β
β βββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
293 ΞΌs Histogram: frequency by time 338 ΞΌs <
Memory estimate: 3.42 KiB, allocs estimate: 95.
Am I overseeing something?
Best, Alex