Unable to eliminate an unwanted allocation

The fact that volstack is a Uint32 array and vol is an Int isn’t causing any trouble?

It has to prove that none is, which is harder. Maybe inlining some function helps?

No. Indeed this is a mistake in working out a minimal reproducer.

Yes, adding @inline to all functions dealing with State avoids the allocation in the heap. Now I start understanding. When the compiler is able to see all the code, it can know that no reference of State leaks and then it allocates in the stack, otherwise it does in the heap to play safe. My problem was that switching between inlining or not inlining some function made the allocation happen or not, thus triggering a fatal error when running in the GPU. The switching between inline and not inline happens by just by adding some more complexity or some innocuous additional statement in the function. This is certainly very fragile, and therefore adding @inline to force inlining helps, but I would guess there is a limit.
It would be better to be more direct and instruct the compiler on the allocation instead of playing with inlining functions. Is there a way to do it?

Maybe a slightly more realistic application would be useful for the GPU people to help. For me it os not clear what are you porting to the GPU. A vector of those structures? If something like that I think you will need to make it immutable.

Thanks very much. My exercise consist of doing a sort of X-ray image (2 dimensional matrix of densities). I did ask the GPU people and the answer I got is that my kernel should not allocate memory. This is why I was chasing this very elusive allocation. This is the kernel I am running.

XRay kernel
function k_generateXRay(result, model, lower::Point3{T}, pixel::T, view::Int) where T<:AbstractFloat
    nx, ny = size(result)
    i = (blockIdx().x - 1) * blockDim().x + threadIdx().x
    j = (blockIdx().y - 1) * blockDim().y + threadIdx().y
    if i <= nx && j <= ny
        if view == 1
            point = Point3{T}(lower[1]+kTolerance(), lower[2]+(i-0.5)*pixel, lower[3]+(j-0.5)*pixel)
            dir = Vector3{T}(1,0,0)
        elseif view == 2
            point = Point3{T}(lower[1]+(j-0.5)*pixel, lower[2]+kTolerance(), lower[3]+(i-0.5)*pixel)
            dir = Vector3{T}(0,1,0)
        elseif view == 3
            point = Point3{T}(lower[1]+(i-0.5)*pixel, lower[2]+(j-0.5)*pixel, lower[3]+kTolerance())
            dir = Vector3{T}(0,0,1)
        end
        state = CuNavigatorState{T}(1)
        locateGlobalPoint!(model, state, point)
        mass::T =  0.0
        step::T = -1.0
        while step != 0.0
            vol = model.volumes[state.currentVol]
            density = model.materials[vol.materialIdx].density
            step = computeStep!(model, state, point, dir, 1000.)
            point = point + dir * step
            mass += step * density
        end
        @inbounds result[i,j] = mass
    end
    return
end

The CuNavigatorState is used as a temporary structure to navigate inside the geometry for each pixel of the X-ray image. I am very new in Julia and GPU programming but I have seen a speedup of 13x with this exercise, once I have declared @inline all the functions related to CuNavigatorState to avoid the crashes due to memory allocations in the GPU.