Unable to eliminate an unwanted allocation

lmiq · May 13, 2022, 3:07pm

The fact that volstack is a Uint32 array and vol is an Int isn’t causing any trouble?

It has to prove that none is, which is harder. Maybe inlining some function helps?

peremato · May 16, 2022, 2:46pm

No. Indeed this is a mistake in working out a minimal reproducer.

Yes, adding @inline to all functions dealing with State avoids the allocation in the heap. Now I start understanding. When the compiler is able to see all the code, it can know that no reference of State leaks and then it allocates in the stack, otherwise it does in the heap to play safe. My problem was that switching between inlining or not inlining some function made the allocation happen or not, thus triggering a fatal error when running in the GPU. The switching between inline and not inline happens by just by adding some more complexity or some innocuous additional statement in the function. This is certainly very fragile, and therefore adding @inline to force inlining helps, but I would guess there is a limit.
It would be better to be more direct and instruct the compiler on the allocation instead of playing with inlining functions. Is there a way to do it?

lmiq · May 16, 2022, 3:03pm

Maybe a slightly more realistic application would be useful for the GPU people to help. For me it os not clear what are you porting to the GPU. A vector of those structures? If something like that I think you will need to make it immutable.

peremato · May 16, 2022, 3:35pm

Thanks very much. My exercise consist of doing a sort of X-ray image (2 dimensional matrix of densities). I did ask the GPU people and the answer I got is that my kernel should not allocate memory. This is why I was chasing this very elusive allocation. This is the kernel I am running.

XRay kernel

function k_generateXRay(result, model, lower::Point3{T}, pixel::T, view::Int) where T<:AbstractFloat
    nx, ny = size(result)
    i = (blockIdx().x - 1) * blockDim().x + threadIdx().x
    j = (blockIdx().y - 1) * blockDim().y + threadIdx().y
    if i <= nx && j <= ny
        if view == 1
            point = Point3{T}(lower[1]+kTolerance(), lower[2]+(i-0.5)*pixel, lower[3]+(j-0.5)*pixel)
            dir = Vector3{T}(1,0,0)
        elseif view == 2
            point = Point3{T}(lower[1]+(j-0.5)*pixel, lower[2]+kTolerance(), lower[3]+(i-0.5)*pixel)
            dir = Vector3{T}(0,1,0)
        elseif view == 3
            point = Point3{T}(lower[1]+(i-0.5)*pixel, lower[2]+(j-0.5)*pixel, lower[3]+kTolerance())
            dir = Vector3{T}(0,0,1)
        end
        state = CuNavigatorState{T}(1)
        locateGlobalPoint!(model, state, point)
        mass::T =  0.0
        step::T = -1.0
        while step != 0.0
            vol = model.volumes[state.currentVol]
            density = model.materials[vol.materialIdx].density
            step = computeStep!(model, state, point, dir, 1000.)
            point = point + dir * step
            mass += step * density
        end
        @inbounds result[i,j] = mass
    end
    return
end

The CuNavigatorState is used as a temporary structure to navigate inside the geometry for each pixel of the X-ray image. I am very new in Julia and GPU programming but I have seen a speedup of 13x with this exercise, once I have declared @inline all the functions related to CuNavigatorState to avoid the crashes due to memory allocations in the GPU.

Topic		Replies	Views
How To Avoiding Allocations in Static Structs Performance array , memory-allocation , struct	4	413	December 5, 2022
Disabling allocations Performance	50	5538	December 10, 2020
Prevent huge number of allocations mutating columns of arrays Performance	15	533	September 19, 2023
Removing undesirable allocations in some functions Performance performance , memory-allocation	15	413	January 27, 2024
Memory allocation due to assignment in a fixed size mutable array General Usage	1	610	July 31, 2018

Unable to eliminate an unwanted allocation

Related topics