The PhaseSpacePoint object is essentially a collection of SFourMomentum objects, stored in tuples and some singletons for meta information. SFourMomentum itself is an object with 4 simple data members, but behaves like a 4-SVector. The important part here should only be that it’s entirely isbits.
The example I gave should pretty closely match the original kernel, in that it pushes objects to the GPU (allocated in Vector/CuVector, just like my original example), which themselves contain SVector-like objects, in the example case BiSpinor/AdjointBiSpinor/DiracMatrix. I didn’t push SVectors of input to the GPU in either case.
The problem isn’t evaluating the kernel I originally posted once. It has to be evaluated millions/billions of times to be useful (in the end it will be used for many-dimensional Monte Carlo integration). So currently I have this one big monolithic kernel that evaluates f(psp::PhaseSpacePoint) = y::Float, and I call this on some large [Cu]Vector of PhaseSpacePoints. As we’ve seen, f can get arbitrarily big depending on the scattering process I’m considering, so I’m looking for ways to break the kernel into smaller parts to be able to parallelize insidef, not just across it.
I’m not sure if this is the way to go, since this would introduce communication between threads (and lots more code complexity), which is usually something you want to avoid on GPUs.
I suspect that all of the local load/stores are due to allocations that haven’t been optimized away. In your example, the product of a Bispinor and AdjointBiSpinor is a DiracMatrix. The function _mul returns a new DiracMatrix(BS * transpose(aBS)).
The (roughly) CPU-equivalent code for your example is:
function testmul(in1, in2, out, n)
@inbounds A = in1[n]
@inbounds B = in2[n]
@inbounds out[n] = A*B
end
If one uses Cthulhu.jl with inline_cost=true, then one sees that the inline cost for _mul is 128.
This exceeds the standard cost threshold of 100, so the function call isn’t automatically inlined. Please see Inlining 101
I believe that when you use always_inline=true, or manually annotate a method with @inline, the inlining cost threshold is increased 20x i.e. inline_cost_threshold becomes 2000 instead of 100 (for each method). So, your simple case results in inlining, but it may not be true for larger cases, especially if there are hidden dynamic dispatches. Certain statements are hit with a inline_nonleaf_penalty=1000, for example. C.f. julia/base/compiler/optimize.jl at a0ff94433010eeb5d1f3297fb4fffdf4a5ef5f7e · JuliaLang/julia · GitHub
So, I’m wondering if a better strategy for you might be to manually annotate more of your methods with @inline? For example, you can avoid requiring always_inline=true in your simple example if you annotate the _mul functions within multiplication.jl. You can then slowly increase the complexity of your models in order to discover where inlining starts to fail again, and then take appropriate action.
For me, this error often happens when the temporal variables are not immediately released, especially in the case of carrying out fft.
I solve this by manually release the GPU memory using CUDA.unsafe_free!.
a = cu(rand(512, 512, 512))
a = nothing # this will not GC immediately
CUDA.unsafe_free!(a) # GPU memory is immediately released. And you should never use variable a before redefining it
Furthermore, I don’t see any local load/stores for that kernel when I use @device_code_ptx without using always_inline=true, and I inline the _mul statements. Same for sass code: