I don't understand why it is slower with CuStaticSharedArray

Hello,
I am new to Julia and I am trying to use an iterative scheme to solve a force balance equation, concretely the kernel without shared memory is quite simple:

function kernel_comp_v_noshmem!(vn, v, F)
    # Indexing (i,j)
    i::Int32 = (blockIdx().x - 1f0) * blockDim().x + threadIdx().x
    j::Int32 = (blockIdx().y - 1f0) * blockDim().y + threadIdx().y

    i_::Int32 = i== 1 ? N : i-1; ip::Int32 = i== N ? 1 : i+1
    jm::Int32 = j== 1 ? N : j-1; jp::Int32 = j== N ? 1 : j+1

    @inbounds begin
        vn[i,j,1] = 1/(1+6/Δ2)*(F[i,j,1] + (   2*(v[i_,j,1]+v[ip,j,1]) + v[i,jm,1]+v[i,jp,1] + 0.25*(v[ip,jp,2]+v[i_,jm,2]-v[i_,jp,2]-v[ip,jm,2])  )/Δ2)
        vn[i,j,2] = 1/(1+6/Δ2)*(F[i,j,2] + (   2*(v[i,jm,2]+v[i,jp,2]) + v[i_,j,2]+v[ip,j,2] + 0.25*(v[ip,jp,1]+v[i_,jm,1]-v[i_,jp,1]-v[ip,jm,1])  )/Δ2)
    end

    return nothing
end

To avoid doing lot of global memory access, I wanted to use shared memory:

function kernel_comp_v!(vn, v, F)
    tx = threadIdx().x
    ty = threadIdx().y
    # Global indices (i, j)
    i = (blockIdx().x - 1) * blockDim().x + tx
    j = (blockIdx().y - 1) * blockDim().y + ty
    # Local indices (li, lj)
    li = tx+1; lip = li+1; lim=tx 
    lj = ty+1; ljp = lj+1; ljm=ty

    @inbounds begin
        v_shared = CuStaticSharedArray(Tf, (T2,T2,2))
        # Borders v[1,2:T+1,:], v[T+2,2:T+1,:], v[2:T+1,1,:], v[2:T+1,T+2,:]
        if tx==1
            i_ = i==1 ? N : i-1
            v_shared[1,lj,1] = v[i_,j,1]
            v_shared[1,lj,2] = v[i_,j,2]
        elseif tx==T
            ip = i== N ? 1 : i+1
            v_shared[T2,lj,1] = v[ip,j,1]
            v_shared[T2,lj,2] = v[ip,j,2]
        end
        if ty==1
            jm = j==1 ? N : j-1
            v_shared[li,1,1] = v[i,jm,1]
            v_shared[li,1,2] = v[i,jm,2]
        elseif ty==T
            jp = j==N ? 1 : j+1
            v_shared[li,T2,1] = v[i,jp,1]
            v_shared[li,T2,2] = v[i,jp,2]
        end
        # Corners v[1,1,:], v[1,T+2,:], v[T+2,1,:], v[T+2,T+2,:]
        # I use tx/y == 2 and tx/y==T-1 to do less operations on tx/y=1 and tx/y=T
        if tx == 2 && ty == 2 # -> [1,1,:]
            i_ = i<3 ? (i==2 ? N : N-1) : i-2
            jm = j<3 ? (j==2 ? N : N-1) : j-2
            v_shared[1,1,1] = v[i_,jm,1]
            v_shared[1,1,2] = v[i_,jm,2]
        elseif tx == 2 && ty == T-1 # -> [1,18,:]
            i_ = i<3 ? (i==2 ? N : N-1) : i-2
            jp = j>N-2 ? (j==N ? 2 : 1) : j+2
            v_shared[1,T2,1] = v[i_,jp,1]
            v_shared[1,T2,2] = v[i_,jp,2]
        elseif tx == T-1 && ty == 2 # -> [T+2,1,:]
            ip = i>N-2 ? (i==N ? 2 : 1) : i+2
            jm = j<3 ? (j==2 ? N : N-1) : j-2
            v_shared[T2,1,1] = v[ip,jm,1]
            v_shared[T2,1,2] = v[ip,jm,2]
        elseif tx == T-1 && ty == T-1 # -> [T+2,T+2,:]
            ip = i>N-2 ? (i==N ? 2 : 1) : i+2
            jp = j>N-2 ? (j==N ? 2 : 1) : j+2
            v_shared[T2,T2,1] = v[ip,jp,1]
            v_shared[T2,T2,2] = v[ip,jp,2]
        end
        v_shared[li,lj,1] = v[i,j,1]
        v_shared[li,lj,2] = v[i,j,2]

        # synchronize threads
        sync_threads()

        vn[i,j,1] = 1/(1+6/Δ2)*(F[i,j,1] + (   2*(v_shared[lim,lj,1]+v_shared[lip,lj,1]) + v_shared[li,ljm,1]+v_shared[li,ljp,1] + 0.25*(v_shared[lip,ljp,2]+v_shared[lim,ljm,2]-v_shared[lim,ljp,2]-v_shared[lip,ljm,2])  )/Δ2)
        vn[i,j,2] = 1/(1+6/Δ2)*(F[i,j,2] + (   2*(v_shared[li,ljm,2]+v_shared[li,ljp,2]) + v_shared[lim,lj,2]+v_shared[lip,lj,2] + 0.25*(v_shared[lip,ljp,1]+v_shared[lim,ljm,1]-v_shared[lim,ljp,1]-v_shared[lip,ljm,1])  )/Δ2)
                    
    end

    return nothing
end

To measure performance I use

@benchmark CUDA.@sync begin 
   comp_v!($v_temp, $v, $F; threads = block_dim, blocks = grid_dim)
end

Without shared memory: 630µs on A2000 ada, 112µs on A30
With shared memory: 637µs on A2000 ada, 143µs on A30
Every CuArrays are using FP64 precision. Setting the type of all the indices to Int32 doesn’t impact the benchmark or number of registers…

Thank you for your help,
Best

1 Like

Hi,

Could you also post the (const) values of N, Δ2 (1e-4?), Tf (Float64?), T2, T, etc.? In other words, please provide all the necessary code so that others can also run it.

Did you try timing the shared memory loading itself (by returning after the sync_threads())? Given how complicated the code looks, my guess is simply that loading to shared memory currently takes up more time than you gain from using it. Perhaps using 1D indexing (instead of 2D indexing: (i, j)) might be faster?

Hello,
Thank you, here are some informations:

@show const Ti = Int32
@show const Tf = Float64

@show const Δ::Tf = Tf(df[:dx])
const Δ2::Tf = Δ*Δ
const L = Tf(df[:L])
@show Napprox = L / Δ 

@show const T::Ti = 32 # for A30 or 16 for RTX A2000 ada
@show const B::Ti = ceil(Int, Napprox/T)
@show block_dim = (T, T) # 1024 threads for A30, 512 threads for A2000
@show grid_dim = (B, B) # 32x32 for A30, 63x63 for A2000

@show const N::Ti = T * B
const T2::Ti = T+2

As I compute derivatives, if I don’t use shared memory each thread needs to access to the global memory 16 times. With shared memory each thread will load a maximal of 6 values from the global memory. The multiple if are here to ensure that the maximal number of global memory access to compute derivatives is 6.

I tried to follow this example:


(source: Lecture 10)

Here TxT represents the red array, and T2xT2 the red+yellow array.

I hope I was able to explain a bit what I am trying to do.
I will try to get the maximal number of global memory access for a thread to 4 and see if this change something. I don’t expect a big difference.

Thank you again,

1 Like

Thanks. Such diagrams are always useful.

But regarding the code, could you provide it in a form we can just run it by copy-pasting? I.e. something of the form

using CUDA, BenchmarkTools
(...)

const Ti = Int32
const Tf = Float64

const Δ::Tf = Tf(1e-2)  # (we don't know what your df contains)
(...)

function kernel_comp_v_noshmem!(...)
    (...)
end

function kernel_comp_v!(...)
    (...)
end

v = (...)  # e.g. CUDA.rand(...)
(...)

display(@benchmark CUDA.@sync begin 
   kernel_comp_v_noshmem!!($v_temp, $v, $F; threads = block_dim, blocks = grid_dim)
end)
display(@benchmark CUDA.@sync begin 
   comp_v!($v_temp, $v, $F; threads = block_dim, blocks = grid_dim)
end)

See also (point 4 in)

Sure, sorry I should have done it from the beginning, here it is

using CUDA, BenchmarkTools
CUDA.versioninfo()

@show const Ti = Int32
@show const Tf = Float32

@show const Δ::Tf = Tf(1e-2)
const Δ2::Tf = Δ*Δ
@show Napprox = 1000

@show const T::Ti = 16
@show const B::Ti = ceil(Int, Napprox/T)
@show block_dim = (T, T)
@show grid_dim = (B, B)

@show const N::Ti = T * B
const T2::Ti = T+2

const prefactor::Tf = 1/(1+6/Δ2)
const Δi::Tf = 1/Δ
const Δ2i::Tf = 1/Δ2

set_zero_subnormals(true)

function kernel_comp_v!(vn, v, F)
    i = (blockIdx().x - Ti(1)) * blockDim().x + threadIdx().x
    j = (blockIdx().y - Ti(1)) * blockDim().y + threadIdx().y

    i_::Ti = i==Ti(1) ? N : i-Ti(1); ip::Ti = i== N ? Ti(1) : i+Ti(1)
    jm::Ti = j==Ti(1) ? N : j-Ti(1); jp::Ti = j== N ? Ti(1) : j+Ti(1)

    @inbounds begin
        vn[i,j,1] = prefactor*(F[i,j,1] + (   Tf(2)*(v[i_,j,1]+v[ip,j,1]) + v[i,jm,1]+v[i,jp,1] + Tf(0.25)*(v[ip,jp,2]+v[i_,jm,2]-v[i_,jp,2]-v[ip,jm,2])  )*Δ2i)
        vn[i,j,2] = prefactor*(F[i,j,2] + (   Tf(2)*(v[i,jm,2]+v[i,jp,2]) + v[i_,j,2]+v[ip,j,2] + Tf(0.25)*(v[ip,jp,1]+v[i_,jm,1]-v[i_,jp,1]-v[ip,jm,1])  )*Δ2i)
    end

    return nothing
end

function kernel_comp_v_shmem!(vn, v, F)
    # Indexing (i,j)
    tx::Ti = threadIdx().x
    ty::Ti = threadIdx().y
    i::Ti = (blockIdx().x - Ti(1)) * blockDim().x + tx
    j::Ti = (blockIdx().y - Ti(1)) * blockDim().y + ty
    li::Ti = tx+Ti(1); lip::Ti = li+Ti(1); lim::Ti=tx
    lj::Ti = ty+Ti(1); ljp::Ti = lj+Ti(1); ljm::Ti=ty

    @inbounds begin
        # Load in block shared memory
        v_shared = CuStaticSharedArray(Tf, (T2,T2,2))
        # Borders v[1,2:T+1,:], v[T+2,2:T+1,:], v[2:T+1,1,:], v[2:T+1,T+2,:]
        if tx==1
            i_ = i==1 ? N : i-Ti(1)
            v_shared[1,lj,1] = v[i_,j,1]
            v_shared[1,lj,2] = v[i_,j,2]
        elseif tx==T
            ip = i==N ? Ti(1) : i+Ti(1)
            v_shared[T2,lj,1] = v[ip,j,1]
            v_shared[T2,lj,2] = v[ip,j,2]
        end
        if ty==1
            jm = j==1 ? N : j-Ti(1)
            v_shared[li,1,1] = v[i,jm,1]
            v_shared[li,1,2] = v[i,jm,2]
        elseif ty==T
            jp = j==N ? Ti(1) : j+Ti(1)
            v_shared[li,T2,1] = v[i,jp,1]
            v_shared[li,T2,2] = v[i,jp,2]
        end
        # Corners v[1,1,:], v[1,T+2,:], v[T+2,1,:], v[T+2,T+2,:]
        # I use tx/y == 2 and tx/y==T-1 to do less operations on tx/y=1 and tx/y=T
        if tx == 2 && ty == 2 # -> [1,1,:]
            i_ = i<3 ? (i==2 ? N : N-Ti(1)) : i-Ti(2)
            jm = j<3 ? (j==2 ? N : N-Ti(1)) : j-Ti(2)
            v_shared[1,1,1] = v[i_,jm,1]
            v_shared[1,1,2] = v[i_,jm,2]
        elseif tx == 2 && ty == T-1 # -> [1,18,:]
            i_ = i<3 ? (i==2 ? N : N-Ti(1)) : i-Ti(2)
            jp = j>N-2 ? (j==N ? Ti(2) : Ti(1)) : j+Ti(2)
            v_shared[1,T2,1] = v[i_,jp,1]
            v_shared[1,T2,2] = v[i_,jp,2]
        elseif tx == T-1 && ty == 2 # -> [T+2,1,:]
            ip = i>N-2 ? (i==N ? Ti(2) : Ti(1)) : i+Ti(2)
            jm = j<3 ? (j==2 ? N : N-Ti(1)) : j-Ti(2)
            v_shared[T2,1,1] = v[ip,jm,1]
            v_shared[T2,1,2] = v[ip,jm,2]
        elseif tx == T-1 && ty == T-1 # -> [T+2,T+2,:]
            ip = i>N-2 ? (i==N ? Ti(2) : Ti(1)) : i+Ti(2)
            jp = j>N-2 ? (j==N ? Ti(2) : Ti(1)) : j+Ti(2)
            v_shared[T2,T2,1] = v[ip,jp,1]
            v_shared[T2,T2,2] = v[ip,jp,2]
        end
        v_shared[li,lj,1] = v[i,j,1]
        v_shared[li,lj,2] = v[i,j,2]

        # synchronize threads
        sync_threads()

        vn[i,j,1] = prefactor*(F[i,j,1] + (   Tf(2)*(v_shared[lim,lj,1]+v_shared[lip,lj,1]) + v_shared[li,ljm,1]+v_shared[li,ljp,1] + Tf(0.25)*(v_shared[lip,ljp,2]+v_shared[lim,ljm,2]-v_shared[lim,ljp,2]-v_shared[lip,ljm,2])  )*Δ2i)
        vn[i,j,2] = prefactor*(F[i,j,2] + (   Tf(2)*(v_shared[li,ljm,2]+v_shared[li,ljp,2]) + v_shared[lim,lj,2]+v_shared[lip,lj,2] + Tf(0.25)*(v_shared[lip,ljp,1]+v_shared[lim,ljm,1]-v_shared[lim,ljp,1]-v_shared[lip,ljm,1])  )*Δ2i)
                    
    end

    return nothing
end

v = CUDA.zeros(Tf, N, N, Ti(2));
v_temp = CUDA.zeros(Tf, N, N, Ti(2));
F = CUDA.zeros(Tf, N, N, Ti(2));
F[:,:,1] .= 1.0f0;

comp_v! = @cuda launch=false kernel_comp_v!(v_temp, v, F)
comp_v_shmem! = @cuda launch=false kernel_comp_v_shmem!(v_temp, v, F)

function iterative_step!(v, v_temp, F)
    for _=1:1000
        comp_v!(v_temp, v, F; threads = block_dim, blocks = grid_dim)
        comp_v!(v, v_temp, F; threads = block_dim, blocks = grid_dim)
    end 
end
function iterative_step_shmem!(v, v_temp, F)
    for _=1:1000
        comp_v_shmem!(v_temp, v, F; threads = block_dim, blocks = grid_dim)
        comp_v_shmem!(v, v_temp, F; threads = block_dim, blocks = grid_dim)
    end 
end

display(@benchmark CUDA.@sync begin
    iterative_step!($v, $v_temp, $F)
end)

display(@benchmark CUDA.@sync begin
    iterative_step_shmem!($v, $v_temp, $F)
end)

In this case I get 100ms without shared memory and 130ms with…
Thank you

You may find it useful to profile the two kernels with NVIDIA Nsight Compute. It integrates fantastically with CUDA.jl and provides a lot of information. See here.

I’ve ran both kernels through the profiler and it looks like bank conflicts might be part of the issue. The profiler suggests you could obtain a 40% speed-up by removing bank conflicts.

Your data looks small enough to fit in L2 cache so you weren’t paying the full penalty of global memory access before. Shared memory is not a crazy amount faster than L2 cache so the bank conflicts could easily be removing the benefits.

Note: it’s also easier to coalesce global reads in the shared memory version (you’re doing it somewhat but there’s still some performance to eek out by doing it fully coalesced).

1 Like

Thank you for your reply,

I’ve never been able to make it work, but I’ll try again.

I always believed that arrays are saved, by default, in the global memory. Are the arrays load in L2 cache when the kernel is called ? If I increase the grid size or go to 3D, do you think that I will see a difference ?

I don’t understand the rest of your message, I’ll google “bank conflict”, “coalescence global read”, etc… I saw similar words in the Cooperative Group section of CUDA.jl :slight_smile: .

Sorry this is all new to me.

In case it helps, I also only recently have been able to get it to work. In my case the problems were with permissions and just the supplied example.

  • On Windows you can run ncu in a terminal (e.g. cmd) run as admin. On Ubuntu the obvious approach of sudo ncu ... didn’t work because you switch user and I only had Julia set up for my normal user account. I then simply followed the instructions here, creating /etc/modprobe.d/nvprof.conf and updating initramfs via sudo update-initramfs -u. This still requires root access at this point of course, but afterwards you can run ncu without sudo.
  • The example a = CUDA.rand(1024, 1024, 1024); sin.(a); uses 8 GiB of VRAM, causing a launch error if you do not have more. In hindsight this is completely obvious, but at the time it took me a moment to figure out I just had to reduce the size of a :slight_smile: .

Ah, my bad! Your shared memory kernel actually looked pretty good so I assumed you had more experience.

You might want to check out this YouTube tutorial series. It explains some of the deeper considerations when designing a CUDA kernel. It’s using C++ but the syntax isn’t that dissimilar from CUDA.jl so you’ll be able to pick up the key ideas.

I always believed that arrays are saved, by default, in the global memory. Are the arrays load in L2 cache when the kernel is called ?

The video series I linked explains this much clearer than I’ll be able to but the rough idea is that all memory reads from global memory into registers/shared memory goes through the L2 cache. If you access the same 128-byte cache line from different threads before it has been purged, you’ll read it from the L2 cache the second time rather than directly from global memory.

1 Like

Thank you very much! The tutorial is very clear and provides a perfect answer to my question.