The fact that volstack
is a Uint32
array and vol
is an Int
isn’t causing any trouble?
It has to prove that none is, which is harder. Maybe inlining some function helps?
The fact that volstack
is a Uint32
array and vol
is an Int
isn’t causing any trouble?
It has to prove that none is, which is harder. Maybe inlining some function helps?
No. Indeed this is a mistake in working out a minimal reproducer.
Yes, adding @inline
to all functions dealing with State
avoids the allocation in the heap. Now I start understanding. When the compiler is able to see all the code, it can know that no reference of State
leaks and then it allocates in the stack, otherwise it does in the heap to play safe. My problem was that switching between inlining or not inlining some function made the allocation happen or not, thus triggering a fatal error when running in the GPU. The switching between inline and not inline happens by just by adding some more complexity or some innocuous additional statement in the function. This is certainly very fragile, and therefore adding @inline
to force inlining helps, but I would guess there is a limit.
It would be better to be more direct and instruct the compiler on the allocation instead of playing with inlining functions. Is there a way to do it?
Maybe a slightly more realistic application would be useful for the GPU people to help. For me it os not clear what are you porting to the GPU. A vector of those structures? If something like that I think you will need to make it immutable.
Thanks very much. My exercise consist of doing a sort of X-ray image (2 dimensional matrix of densities). I did ask the GPU people and the answer I got is that my kernel should not allocate memory. This is why I was chasing this very elusive allocation. This is the kernel I am running.
function k_generateXRay(result, model, lower::Point3{T}, pixel::T, view::Int) where T<:AbstractFloat
nx, ny = size(result)
i = (blockIdx().x - 1) * blockDim().x + threadIdx().x
j = (blockIdx().y - 1) * blockDim().y + threadIdx().y
if i <= nx && j <= ny
if view == 1
point = Point3{T}(lower[1]+kTolerance(), lower[2]+(i-0.5)*pixel, lower[3]+(j-0.5)*pixel)
dir = Vector3{T}(1,0,0)
elseif view == 2
point = Point3{T}(lower[1]+(j-0.5)*pixel, lower[2]+kTolerance(), lower[3]+(i-0.5)*pixel)
dir = Vector3{T}(0,1,0)
elseif view == 3
point = Point3{T}(lower[1]+(i-0.5)*pixel, lower[2]+(j-0.5)*pixel, lower[3]+kTolerance())
dir = Vector3{T}(0,0,1)
end
state = CuNavigatorState{T}(1)
locateGlobalPoint!(model, state, point)
mass::T = 0.0
step::T = -1.0
while step != 0.0
vol = model.volumes[state.currentVol]
density = model.materials[vol.materialIdx].density
step = computeStep!(model, state, point, dir, 1000.)
point = point + dir * step
mass += step * density
end
@inbounds result[i,j] = mass
end
return
end
The CuNavigatorState
is used as a temporary structure to navigate inside the geometry for each pixel of the X-ray image. I am very new in Julia and GPU programming but I have seen a speedup of 13x with this exercise, once I have declared @inline
all the functions related to CuNavigatorState
to avoid the crashes due to memory allocations in the GPU.