Accessing array elements too slow?

Hi,
I think I could use some help here… As a part of a Monte Carlo code I’m developing in julia+CUDA, I have a function that takes several input parameters, performs a calculation with them, and stores the result in a CUDA.array that is the first argument being passed to the function.

function Fdrift_CUDA_kernel_xy(Fk,r_ij,Np,dim,Lbox,Nw,
         x_fxy,y_fxy,fxy,dfxy_dx,dfxy_dy)
 
    T                 = eltype(Fk)
    
    Zero              = convert(T,0.f0)
    One               = convert(T,1.f0)
    Two               = convert(T,2.f0)
    dot_Five          = convert(T,0.5f0)
    
    Nx , Ny           = size(fxy)
    dx                = x_fxy[2]-x_fxy[1]
    dy                = y_fxy[2]-y_fxy[1]
    
    Npij              = div(Np*(Np-1),2)
    TwoNp1            = 2*Np+1

    idx_x             = (blockIdx().x - 1) * blockDim().x + threadIdx().x
    str_x             = blockDim().x * gridDim().x    

    @inbounds for i in idx_x:str_x:size(r_ij,1)
 
        x_ij          = abs(r_ij[i,1])
        y_ij          = abs(r_ij[i,2])
        sign_x_ij     = sign(r_ij[i,1])
        sign_y_ij     = sign(r_ij[i,2])
       
        idx_x         = unsafe_trunc(Int,x_ij/dx)
        idx_y         = unsafe_trunc(Int,y_ij/dy)
        
        if  idx_x > Nx || idx_y > Ny
            f00       = f10       = f01       = f11       = One
            df00_dx   = df10_dx   = df01_dx   = df11_dx   = Zero
            df00_dy   = df10_dy   = df01_dy   = df11_dy   = Zero
        else
            f00       = fxy[idx_x  ,idx_y]
            f10       = fxy[idx_x+1,idx_y]
            f01       = fxy[idx_x  ,idx_y+1]
            f11       = fxy[idx_x+1,idx_y+1]
            df00_dx   = dfxy_dx[idx_x  ,idx_y]
            df10_dx   = dfxy_dx[idx_x+1,idx_y]
            df01_dx   = dfxy_dx[idx_x  ,idx_y+1]
            df11_dx   = dfxy_dx[idx_x+1,idx_y+1]
            df00_dy   = dfxy_dy[idx_x  ,idx_y]
            df10_dy   = dfxy_dy[idx_x+1,idx_y]
            df01_dy   = dfxy_dy[idx_x  ,idx_y+1]
            df11_dy   = dfxy_dy[idx_x+1,idx_y+1]            
        end
            
        αx           = x_ij/dx - idx_x              
        αy           = y_ij/dy - idx_y
   
        f            = f00*(One-αx)*(One-αy) + 
                       f10*αx*(One-αy) + 
                       f10*(One-αx)*αy + 
                       f11*αx*αy 
        df_dx        = df00_dx*(One-αx)*(One-αy) + 
                       df10_dx*αx*(One-αy) + 
                       df10_dx*(One-αx)*αy + 
                       df11_dx*αx*αy 
        df_dy        = df00_dy*(One-αx)*(One-αy) + 
                       df10_dy*αx*(One-αy) + 
                       df10_dy*(One-αx)*αy + 
                       df11_dy*αx*αy 
      
        iw           = div(i-1,Npij)   
        idx          = i - iw*Npij     
        ii           = unsafe_trunc(Int,dot_Five*(TwoNp1  - sqrt(TwoNp1^2-8*(idx+Np-1))))
        jj           = idx - Np*(ii-1) + div(ii*(ii-1),2) + ii
        
        Fmod_x       = Two*df_dx/f
        Fmod_y       = Two*df_dy/f
        
        ii_iw                = ii+Np*iw
        jj_iw                = jj+Np*iw
        @atomic Fk[ii_iw,1] += Fmod_x
        @atomic Fk[jj_iw,1] -= Fmod_x
        @atomic Fk[ii_iw,2] += Fmod_y
        @atomic Fk[jj_iw,2] -= Fmod_y        

    end

    nothing
    
end;

I can test this as follows

Fk   = CUDA.zeros(10_000,2)
r_ij = CUDA.randn(995000,2)
Np   = 200
dim  = 2
Lbox = 1.f0
Nw   = 50
x_fxy   = CuArray(collect(range(0.f0,step=0.01f0,length=512)))
y_fxy   = CuArray(collect(range(0.f0,step=0.01f0,length=512)))
fxy     = CUDA.cos.(x_fxy .+ y_fxy')
dfxy_dx = CUDA.sin.(x_fxy .+ y_fxy')
dfxy_dy = CUDA.sin.(x_fxy .+ y_fxy')

numblocks     = 256
@btime CUDA.@sync @cuda threads = 256 blocks = numblocks Fdrift_CUDA_kernel_xy(Fk,r_ij,Np,dim,Lbox,Nw,
         x_fxy,y_fxy,fxy,dfxy_dx,dfxy_dy);

yielding on my ubuntu machine + GTX1050Ti

1.592 ms (1768 allocations: 29.30 KiB)

However, I can dramatically reduce this time if I completely remove the 4 @atomic lines at the end. In that case I get

  18.404 μs (56 allocations: 2.47 KiB)

but of course, the function then becomes meaningless as it doen’t produce any result. If I keep the @atomic lines but remove the @atomic decorator, I get

  1.349 ms (1094 allocations: 18.69 KiB)

which once again is quite a lot compared with the previous result. Now I thought the bottleneck was the @atomic decorator, but I see it is not the most fundamental factor here. It seems to me as if accessing and writing to the Fk array was the culprit, but I can not understand why, as I’m accessing other arrays in the function and they are treated much faster, as you can see.

So maybe I’m doing something wrong? Or is it like this and that’s it?
Any hint would be greatly appreciated :slight_smile:

Best,

Ferran.

In this case, maybe the compiler optimizes away a bunch of stuff so you aren’t timing what you think you are timing?

1 Like

and how can I know that? You mean to say that these optimizations vanish when the function accesses the Fk arrays?

As an example:

julia> f(a) = @inbounds (a[1]; a[2]; a[3]; a[4]; a[5])
f (generic function with 1 method)

julia> @code_llvm f(rand(5))
;  @ REPL[8]:1 within `f'
define double @julia_f_247({}* nonnull align 16 dereferenceable(40) %0) {
top:
; ┌ @ array.jl:801 within `getindex'
   %1 = bitcast {}* %0 to double**
   %2 = load double*, double** %1, align 8
   %3 = getelementptr inbounds double, double* %2, i64 4
   %4 = load double, double* %3, align 8
; └
  ret double %4
}

The source code includes 5 loads from a but only the last one is used for the result. So the four others are optimized. away. If the function would use the results from the other loads, then it couldn’t optimize those away.

Note that this is just a theory for the result you are seeing.

1 Like

And use @device_code_llvm or @device_code_ptx to inspect generated GPU code.

1 Like

You can see there’s no computation happening if you leave out the writes.

No (atomic) writes
; PTX CompilerJob of kernel Fdrift_CUDA_kernel_xy(CuDeviceMatrix{Float32, 1}, CuDeviceMatrix{Float32, 1}, Int64, Int64, Float32, Int64, CuDeviceVector{Float32, 1}, CuDeviceVector{Float32, 1}, CuDeviceMatrix{Float32, 1}, CuDeviceMatrix{Flo
at32, 1}, CuDeviceMatrix{Float32, 1}) for sm_75
define ptx_kernel void @_Z32julia_Fdrift_CUDA_kernel_xy_913313CuDeviceArrayI7Float32Li2ELi1EES_IS0_Li2ELi1EE5Int64S1_S0_S1_S_IS0_Li1ELi1EES_IS0_Li1ELi1EES_IS0_Li2ELi1EES_IS0_Li2ELi1EES_IS0_Li2ELi1EE({ [2 x i64], i8 addrspace(1)* } %0, { [
2 x i64], i8 addrspace(1)* } %1, i64 signext %2, i64 signext %3, float %4, i64 signext %5, { [1 x i64], i8 addrspace(1)* } %6, { [1 x i64], i8 addrspace(1)* } %7, { [2 x i64], i8 addrspace(1)* } %8, { [2 x i64], i8 addrspace(1)* } %9, { [
2 x i64], i8 addrspace(1)* } %10) local_unnamed_addr {
entry:
%.fca.0.0.extract28 = extractvalue { [2 x i64], i8 addrspace(1)* } %1, 0, 0
%.fca.0.0.extract20 = extractvalue { [1 x i64], i8 addrspace(1)* } %6, 0, 0
%.fca.0.0.extract13 = extractvalue { [1 x i64], i8 addrspace(1)* } %7, 0, 0
%11 = icmp slt i64 %.fca.0.0.extract20, 2
br i1 %11, label %L14.i, label %L35.i

L14.i:                                            ; preds = %entry
call fastcc void @gpu_report_exception(i64 ptrtoint ([12 x i8]* @exception21 to i64))
call fastcc void @gpu_signal_exception()
call void asm sideeffect "exit;", ""() #3
call void asm sideeffect "exit;", ""() #3
%.not57 = icmp eq i64 %.fca.0.0.extract20, 1
br i1 %.not57, label %L35.i, label %L36.i

L35.i:                                            ; preds = %L36.i, %L14.i, %entry
%12 = icmp slt i64 %.fca.0.0.extract13, 2
br i1 %12, label %L57.i, label %L78.i

L36.i:                                            ; preds = %L14.i
call fastcc void @gpu_report_exception(i64 ptrtoint ([12 x i8]* @exception21 to i64))
call fastcc void @gpu_signal_exception()
call void asm sideeffect "exit;", ""() #3
call void asm sideeffect "exit;", ""() #3
br label %L35.i

L57.i:                                            ; preds = %L35.i
call fastcc void @gpu_report_exception(i64 ptrtoint ([12 x i8]* @exception21 to i64))
call fastcc void @gpu_signal_exception()
call void asm sideeffect "exit;", ""() #3
call void asm sideeffect "exit;", ""() #3
%.not56 = icmp eq i64 %.fca.0.0.extract13, 1
br i1 %.not56, label %L78.i, label %L79.i

L78.i:                                            ; preds = %L79.i, %L57.i, %L35.i
%13 = add i64 %2, -1
%14 = mul i64 %13, %2
%15 = call i32 @llvm.nvvm.read.ptx.sreg.ctaid.x()
%16 = zext i32 %15 to i64
%17 = call i32 @llvm.nvvm.read.ptx.sreg.ntid.x()
%18 = zext i32 %17 to i64
%19 = mul nuw nsw i64 %18, %16
%20 = call i32 @llvm.nvvm.read.ptx.sreg.tid.x()
%narrow = add nuw nsw i32 %20, 1
%21 = zext i32 %narrow to i64
%22 = add nuw nsw i64 %19, %21
%23 = call i32 @llvm.nvvm.read.ptx.sreg.nctaid.x()
%24 = zext i32 %23 to i64
%25 = mul nuw nsw i64 %24, %18
%26 = call fastcc i64 @julia_steprange_last_9213(i64 signext %22, i64 signext %25, i64 signext %.fca.0.0.extract28)
%.not.not = icmp sgt i64 %22, %26
br i1 %.not.not, label %_Z32julia_Fdrift_CUDA_kernel_xy_913313CuDeviceArrayI7Float32Li2ELi1EES_IS0_Li2ELi1EE5Int64S1_S0_S1_S_IS0_Li1ELi1EES_IS0_Li1ELi1EES_IS0_Li2ELi1EES_IS0_Li2ELi1EES_IS0_Li2ELi1EE.inner.exit, label %L169.i.preheader

L169.i.preheader:                                 ; preds = %L78.i
%.off54 = add i64 %14, 1
%27 = icmp ugt i64 %.off54, 2
br i1 %27, label %L169.i.preheader.split.us, label %pass7.i

L169.i.preheader.split.us:                        ; preds = %L169.i.preheader
%.off = add i64 %14, 3
%28 = icmp ugt i64 %.off, 1
br i1 %28, label %pass7.i.us.us, label %L169.i.us

pass7.i.us.us:                                    ; preds = %pass7.i.us.us, %L169.i.preheader.split.us
%value_phi3.i.us.us = phi i64 [ %31, %pass7.i.us.us ], [ %22, %L169.i.preheader.split.us ]
%29 = call i32 @__nvvm_reflect(i8* getelementptr inbounds ([11 x i8], [11 x i8]* @.str, i64 0, i64 0)) #3
%30 = call i32 @__nvvm_reflect(i8* getelementptr inbounds ([11 x i8], [11 x i8]* @.str, i64 0, i64 0)) #3
%.not.not55.us.us = icmp eq i64 %value_phi3.i.us.us, %26
%31 = add i64 %value_phi3.i.us.us, %25
br i1 %.not.not55.us.us, label %_Z32julia_Fdrift_CUDA_kernel_xy_913313CuDeviceArrayI7Float32Li2ELi1EES_IS0_Li2ELi1EE5Int64S1_S0_S1_S_IS0_Li1ELi1EES_IS0_Li1ELi1EES_IS0_Li2ELi1EES_IS0_Li2ELi1EES_IS0_Li2ELi1EE.inner.exit, label %pass7.i.us
.us                                                                                                                                                                                                                                            

L169.i.us:                                        ; preds = %pass7.i.us, %L169.i.preheader.split.us
%value_phi3.i.us = phi i64 [ %34, %pass7.i.us ], [ %22, %L169.i.preheader.split.us ]
%32 = call i32 @__nvvm_reflect(i8* getelementptr inbounds ([11 x i8], [11 x i8]* @.str, i64 0, i64 0)) #3
%33 = call i32 @__nvvm_reflect(i8* getelementptr inbounds ([11 x i8], [11 x i8]* @.str, i64 0, i64 0)) #3
%.not = icmp eq i64 %value_phi3.i.us, -9223372036854775807
br i1 %.not, label %fail6.i.us, label %pass7.i.us

fail6.i.us:                                       ; preds = %L169.i.us
call fastcc void @gpu_report_exception(i64 ptrtoint ([10 x i8]* @exception9 to i64))
call fastcc void @gpu_signal_exception()
call void asm sideeffect "exit;", ""() #3
br label %pass7.i.us

pass7.i.us:                                       ; preds = %fail6.i.us, %L169.i.us
%.not.not55.us = icmp eq i64 %value_phi3.i.us, %26
%34 = add i64 %value_phi3.i.us, %25
br i1 %.not.not55.us, label %_Z32julia_Fdrift_CUDA_kernel_xy_913313CuDeviceArrayI7Float32Li2ELi1EES_IS0_Li2ELi1EE5Int64S1_S0_S1_S_IS0_Li1ELi1EES_IS0_Li1ELi1EES_IS0_Li2ELi1EES_IS0_Li2ELi1EES_IS0_Li2ELi1EE.inner.exit, label %L169.i.us

L79.i:                                            ; preds = %L57.i
call fastcc void @gpu_report_exception(i64 ptrtoint ([12 x i8]* @exception21 to i64))
call fastcc void @gpu_signal_exception()
call void asm sideeffect "exit;", ""() #3
call void asm sideeffect "exit;", ""() #3
br label %L78.i

pass7.i:                                          ; preds = %pass7.i, %L169.i.preheader
%value_phi3.i = phi i64 [ %37, %pass7.i ], [ %22, %L169.i.preheader ]
%35 = call i32 @__nvvm_reflect(i8* getelementptr inbounds ([11 x i8], [11 x i8]* @.str, i64 0, i64 0)) #3
%36 = call i32 @__nvvm_reflect(i8* getelementptr inbounds ([11 x i8], [11 x i8]* @.str, i64 0, i64 0)) #3
call fastcc void @gpu_report_exception(i64 ptrtoint ([10 x i8]* @exception9 to i64))
call fastcc void @gpu_signal_exception()
call void asm sideeffect "exit;", ""() #3
%.not.not55 = icmp eq i64 %value_phi3.i, %26
%37 = add i64 %value_phi3.i, %25
br i1 %.not.not55, label %_Z32julia_Fdrift_CUDA_kernel_xy_913313CuDeviceArrayI7Float32Li2ELi1EES_IS0_Li2ELi1EE5Int64S1_S0_S1_S_IS0_Li1ELi1EES_IS0_Li1ELi1EES_IS0_Li2ELi1EES_IS0_Li2ELi1EES_IS0_Li2ELi1EE.inner.exit, label %pass7.i

_Z32julia_Fdrift_CUDA_kernel_xy_913313CuDeviceArrayI7Float32Li2ELi1EES_IS0_Li2ELi1EE5Int64S1_S0_S1_S_IS0_Li1ELi1EES_IS0_Li1ELi1EES_IS0_Li2ELi1EES_IS0_Li2ELi1EES_IS0_Li2ELi1EE.inner.exit: ; preds = %pass7.i, %pass7.i.us, %pass7.i.us.us, %L
78.i                                                                                                                                                                                                                                           
ret void
}
Original code
; PTX CompilerJob of kernel Fdrift_CUDA_kernel_xy(CuDeviceMatrix{Float32, 1}, CuDeviceMatrix{Float32, 1}, Int64, Int64, Float32, Int64, CuDeviceVector{Float32, 1}, CuDeviceVector{Float32, 1}, CuDeviceMatrix{Float32, 1}, CuDeviceMatrix{Float32, 1}, CuDeviceMatrix{Float32, 1}) for sm_75
define ptx_kernel void @_Z32julia_Fdrift_CUDA_kernel_xy_894113CuDeviceArrayI7Float32Li2ELi1EES_IS0_Li2ELi1EE5Int64S1_S0_S1_S_IS0_Li1ELi1EES_IS0_Li1ELi1EES_IS0_Li2ELi1EES_IS0_Li2ELi1EES_IS0_Li2ELi1EE({ [2 x i64], i8 addrspace(1)* } %0, { [2 x i64], i8 addrspace(1)* } %1, i64 signext %2, i64 signext %3, float %4, i64 signext %5, { [1 x i64], i8 addrspace(1)* } %6, { [1 x i64], i8 addrspace(1)* } %7, { [2 x i64], i8 addrspace(1)* } %8, { [2 x i64], i8 addrspace(1)* } %9, { [2 x i64], i8 addrspace(1)* } %10) local_unnamed_addr {
entry:
  %.fca.0.0.extract58 = extractvalue { [2 x i64], i8 addrspace(1)* } %0, 0, 0
  %.fca.1.extract60 = extractvalue { [2 x i64], i8 addrspace(1)* } %0, 1
  %.fca.0.0.extract50 = extractvalue { [2 x i64], i8 addrspace(1)* } %1, 0, 0
  %.fca.1.extract52 = extractvalue { [2 x i64], i8 addrspace(1)* } %1, 1
  %.fca.0.0.extract42 = extractvalue { [1 x i64], i8 addrspace(1)* } %6, 0, 0
  %.fca.1.extract43 = extractvalue { [1 x i64], i8 addrspace(1)* } %6, 1
  %.fca.0.0.extract35 = extractvalue { [1 x i64], i8 addrspace(1)* } %7, 0, 0
  %.fca.1.extract36 = extractvalue { [1 x i64], i8 addrspace(1)* } %7, 1
  %.fca.0.0.extract22 = extractvalue { [2 x i64], i8 addrspace(1)* } %8, 0, 0
  %.fca.0.1.extract23 = extractvalue { [2 x i64], i8 addrspace(1)* } %8, 0, 1
  %.fca.1.extract24 = extractvalue { [2 x i64], i8 addrspace(1)* } %8, 1
  %.fca.0.0.extract9 = extractvalue { [2 x i64], i8 addrspace(1)* } %9, 0, 0
  %.fca.1.extract11 = extractvalue { [2 x i64], i8 addrspace(1)* } %9, 1
  %.fca.0.0.extract = extractvalue { [2 x i64], i8 addrspace(1)* } %10, 0, 0
  %.fca.1.extract = extractvalue { [2 x i64], i8 addrspace(1)* } %10, 1
  %11 = icmp slt i64 %.fca.0.0.extract42, 2
  br i1 %11, label %L14.i, label %L14.i.thread

L14.i.thread:                                     ; preds = %entry
  %12 = getelementptr inbounds i8, i8 addrspace(1)* %.fca.1.extract43, i64 4
  %13 = bitcast i8 addrspace(1)* %12 to float addrspace(1)*
  %14 = load float, float addrspace(1)* %13, align 4
  br label %L35.i

L14.i:                                            ; preds = %entry
  call fastcc void @gpu_report_exception(i64 ptrtoint ([12 x i8]* @exception21 to i64))
  call fastcc void @gpu_signal_exception()
  call void asm sideeffect "exit;", ""() #4
  call void asm sideeffect "exit;", ""() #4
  %15 = getelementptr inbounds i8, i8 addrspace(1)* %.fca.1.extract43, i64 4
  %16 = bitcast i8 addrspace(1)* %15 to float addrspace(1)*
  %17 = load float, float addrspace(1)* %16, align 4
  %.not84 = icmp eq i64 %.fca.0.0.extract42, 1
  br i1 %.not84, label %L35.i, label %L36.i

L35.i:                                            ; preds = %L36.i, %L14.i, %L14.i.thread
  %18 = phi float [ %14, %L14.i.thread ], [ %17, %L36.i ], [ %17, %L14.i ]
  %19 = bitcast i8 addrspace(1)* %.fca.1.extract43 to float addrspace(1)*
  %20 = load float, float addrspace(1)* %19, align 4
  %21 = fsub float %18, %20
  %22 = icmp slt i64 %.fca.0.0.extract35, 2
  br i1 %22, label %L57.i, label %L57.i.thread

L57.i.thread:                                     ; preds = %L35.i
  %23 = getelementptr inbounds i8, i8 addrspace(1)* %.fca.1.extract36, i64 4
  %24 = bitcast i8 addrspace(1)* %23 to float addrspace(1)*
  %25 = load float, float addrspace(1)* %24, align 4
  br label %L78.i

L36.i:                                            ; preds = %L14.i
  call fastcc void @gpu_report_exception(i64 ptrtoint ([12 x i8]* @exception21 to i64))
  call fastcc void @gpu_signal_exception()
  call void asm sideeffect "exit;", ""() #4
  call void asm sideeffect "exit;", ""() #4
  br label %L35.i

L57.i:                                            ; preds = %L35.i
  call fastcc void @gpu_report_exception(i64 ptrtoint ([12 x i8]* @exception21 to i64))
  call fastcc void @gpu_signal_exception()
  call void asm sideeffect "exit;", ""() #4
  call void asm sideeffect "exit;", ""() #4
  %26 = getelementptr inbounds i8, i8 addrspace(1)* %.fca.1.extract36, i64 4
  %27 = bitcast i8 addrspace(1)* %26 to float addrspace(1)*
  %28 = load float, float addrspace(1)* %27, align 4
  %.not = icmp eq i64 %.fca.0.0.extract35, 1
  br i1 %.not, label %L78.i, label %L79.i

L78.i:                                            ; preds = %L79.i, %L57.i, %L57.i.thread
  %29 = phi float [ %25, %L57.i.thread ], [ %28, %L79.i ], [ %28, %L57.i ]
  %30 = bitcast i8 addrspace(1)* %.fca.1.extract36 to float addrspace(1)*
  %31 = load float, float addrspace(1)* %30, align 4
  %32 = fsub float %29, %31
  %33 = add i64 %2, -1
  %34 = mul i64 %33, %2
  %35 = sdiv i64 %34, 2
  %36 = shl i64 %2, 1
  %37 = or i64 %36, 1
  %38 = call i32 @llvm.nvvm.read.ptx.sreg.ctaid.x()
  %39 = zext i32 %38 to i64
  %40 = call i32 @llvm.nvvm.read.ptx.sreg.ntid.x()
  %41 = zext i32 %40 to i64
  %42 = mul nuw nsw i64 %41, %39
  %43 = call i32 @llvm.nvvm.read.ptx.sreg.tid.x()
  %narrow = add nuw nsw i32 %43, 1
  %44 = zext i32 %narrow to i64
  %45 = add nuw nsw i64 %42, %44
  %46 = call i32 @llvm.nvvm.read.ptx.sreg.nctaid.x()
  %47 = zext i32 %46 to i64
  %48 = mul nuw nsw i64 %47, %41
  %49 = call fastcc i64 @julia_steprange_last_9027(i64 signext %45, i64 signext %48, i64 signext %.fca.0.0.extract50)
  %.not.not = icmp sgt i64 %45, %49
  br i1 %.not.not, label %_Z32julia_Fdrift_CUDA_kernel_xy_894113CuDeviceArrayI7Float32Li2ELi1EES_IS0_Li2ELi1EE5Int64S1_S0_S1_S_IS0_Li1ELi1EES_IS0_Li1ELi1EES_IS0_Li2ELi1EES_IS0_Li2ELi1EES_IS0_Li2ELi1EE.inner.exit, label %L169.i.preheader

L169.i.preheader:                                 ; preds = %L78.i
  %50 = bitcast i8 addrspace(1)* %.fca.1.extract52 to float addrspace(1)*
  %51 = icmp sgt i64 %.fca.0.0.extract50, 0
  %52 = select i1 %51, i64 %.fca.0.0.extract50, i64 0
  %53 = icmp sgt i64 %.fca.0.0.extract22, 0
  %54 = select i1 %53, i64 %.fca.0.0.extract22, i64 0
  %55 = bitcast i8 addrspace(1)* %.fca.1.extract24 to float addrspace(1)*
  %56 = icmp sgt i64 %.fca.0.0.extract9, 0
  %57 = select i1 %56, i64 %.fca.0.0.extract9, i64 0
  %58 = bitcast i8 addrspace(1)* %.fca.1.extract11 to float addrspace(1)*
  %59 = icmp sgt i64 %.fca.0.0.extract, 0
  %60 = select i1 %59, i64 %.fca.0.0.extract, i64 0
  %61 = bitcast i8 addrspace(1)* %.fca.1.extract to float addrspace(1)*
  %.off = add i64 %34, 3
  %62 = icmp ugt i64 %.off, 1
  %.off80 = add i64 %34, 1
  %63 = icmp ugt i64 %.off80, 2
  %64 = mul i64 %37, %37
  %.neg82 = add i64 %64, 8
  %65 = sitofp i64 %37 to double
  %66 = icmp sgt i64 %.fca.0.0.extract58, 0
  %67 = select i1 %66, i64 %.fca.0.0.extract58, i64 0
  br label %L169.i

L79.i:                                            ; preds = %L57.i
  call fastcc void @gpu_report_exception(i64 ptrtoint ([12 x i8]* @exception21 to i64))
  call fastcc void @gpu_signal_exception()
  call void asm sideeffect "exit;", ""() #4
  call void asm sideeffect "exit;", ""() #4
  br label %L78.i

L169.i:                                           ; preds = %L1347.i, %L169.i.preheader
  %value_phi3.i = phi i64 [ %195, %L1347.i ], [ %45, %L169.i.preheader ]
  %68 = add i64 %value_phi3.i, -1
  %69 = getelementptr inbounds float, float addrspace(1)* %50, i64 %68
  %70 = load float, float addrspace(1)* %69, align 4
  %71 = call i32 @__nvvm_reflect(i8* getelementptr inbounds ([11 x i8], [11 x i8]* @.str, i64 0, i64 0)) #4
  %.not.i = icmp eq i32 %71, 0
  br i1 %.not.i, label %74, label %72

72:                                               ; preds = %L169.i
  %73 = call float @llvm.nvvm.fabs.ftz.f(float %70) #4
  br label %__nv_fabsf.exit

74:                                               ; preds = %L169.i
  %75 = call float @llvm.fabs.f32(float %70) #4
  br label %__nv_fabsf.exit

__nv_fabsf.exit:                                  ; preds = %74, %72
  %.0.i = phi float [ %73, %72 ], [ %75, %74 ]
  %76 = add i64 %68, %52
  %77 = getelementptr inbounds float, float addrspace(1)* %50, i64 %76
  %78 = load float, float addrspace(1)* %77, align 4
  %79 = call i32 @__nvvm_reflect(i8* getelementptr inbounds ([11 x i8], [11 x i8]* @.str, i64 0, i64 0)) #4
  %.not.i77 = icmp eq i32 %79, 0
  br i1 %.not.i77, label %82, label %80

80:                                               ; preds = %__nv_fabsf.exit
  %81 = call float @llvm.nvvm.fabs.ftz.f(float %78) #4
  br label %__nv_fabsf.exit79

82:                                               ; preds = %__nv_fabsf.exit
  %83 = call float @llvm.fabs.f32(float %78) #4
  br label %__nv_fabsf.exit79

__nv_fabsf.exit79:                                ; preds = %82, %80
  %.0.i78 = phi float [ %81, %80 ], [ %83, %82 ]
  %84 = fdiv float %.0.i, %21
  %85 = fptosi float %84 to i64
  %86 = fdiv float %.0.i78, %32
  %87 = fptosi float %86 to i64
  %88 = icmp sge i64 %.fca.0.0.extract22, %85
  %89 = icmp sge i64 %.fca.0.1.extract23, %87
  %90 = and i1 %88, %89
  br i1 %90, label %L420.i, label %L1141.i

L420.i:                                           ; preds = %__nv_fabsf.exit79
  %91 = add i64 %87, -1
  %92 = mul i64 %91, %54
  %93 = add i64 %85, -1
  %94 = add i64 %93, %92
  %95 = getelementptr inbounds float, float addrspace(1)* %55, i64 %94
  %96 = load float, float addrspace(1)* %95, align 4
  %97 = add i64 %92, %85
  %98 = getelementptr inbounds float, float addrspace(1)* %55, i64 %97
  %99 = load float, float addrspace(1)* %98, align 4
  %100 = mul i64 %54, %87
  %101 = add i64 %100, %85
  %102 = getelementptr inbounds float, float addrspace(1)* %55, i64 %101
  %103 = load float, float addrspace(1)* %102, align 4
  %104 = mul i64 %91, %57
  %105 = add i64 %93, %104
  %106 = getelementptr inbounds float, float addrspace(1)* %58, i64 %105
  %107 = load float, float addrspace(1)* %106, align 4
  %108 = add i64 %104, %85
  %109 = getelementptr inbounds float, float addrspace(1)* %58, i64 %108
  %110 = load float, float addrspace(1)* %109, align 4
  %111 = mul i64 %57, %87
  %112 = add i64 %111, %85
  %113 = getelementptr inbounds float, float addrspace(1)* %58, i64 %112
  %114 = load float, float addrspace(1)* %113, align 4
  %115 = mul i64 %91, %60
  %116 = add i64 %93, %115
  %117 = getelementptr inbounds float, float addrspace(1)* %61, i64 %116
  %118 = load float, float addrspace(1)* %117, align 4
  %119 = add i64 %115, %85
  %120 = getelementptr inbounds float, float addrspace(1)* %61, i64 %119
  %121 = load float, float addrspace(1)* %120, align 4
  %122 = mul i64 %60, %87
  %123 = add i64 %122, %85
  %124 = getelementptr inbounds float, float addrspace(1)* %61, i64 %123
  %125 = load float, float addrspace(1)* %124, align 4
  br label %L1141.i

L1141.i:                                          ; preds = %L420.i, %__nv_fabsf.exit79
  %value_phi6.i = phi float [ %125, %L420.i ], [ 0.000000e+00, %__nv_fabsf.exit79 ]
  %value_phi7.i = phi float [ %121, %L420.i ], [ 0.000000e+00, %__nv_fabsf.exit79 ]
  %value_phi8.i = phi float [ %118, %L420.i ], [ 0.000000e+00, %__nv_fabsf.exit79 ]
  %value_phi9.i = phi float [ %114, %L420.i ], [ 0.000000e+00, %__nv_fabsf.exit79 ]
  %value_phi10.i = phi float [ %110, %L420.i ], [ 0.000000e+00, %__nv_fabsf.exit79 ]
  %value_phi11.i = phi float [ %107, %L420.i ], [ 0.000000e+00, %__nv_fabsf.exit79 ]
  %value_phi12.i = phi float [ %103, %L420.i ], [ 1.000000e+00, %__nv_fabsf.exit79 ]
  %value_phi13.i = phi float [ %99, %L420.i ], [ 1.000000e+00, %__nv_fabsf.exit79 ]
  %value_phi14.i = phi float [ %96, %L420.i ], [ 1.000000e+00, %__nv_fabsf.exit79 ]
  %126 = sitofp i64 %85 to float
  %127 = fsub float %84, %126
  %128 = sitofp i64 %87 to float
  %129 = fsub float %86, %128
  %130 = fsub float 1.000000e+00, %127
  %131 = fsub float 1.000000e+00, %129
  %132 = fmul float %130, %value_phi14.i
  %133 = fmul float %131, %132
  %134 = fmul float %127, %value_phi13.i
  %135 = fmul float %131, %134
  %136 = fmul float %130, %value_phi13.i
  %137 = fmul float %129, %136
  %138 = fmul float %127, %value_phi12.i
  %139 = fmul float %129, %138
  %140 = fadd float %135, %133
  %141 = fadd float %137, %140
  %142 = fadd float %139, %141
  %143 = fmul float %130, %value_phi11.i
  %144 = fmul float %131, %143
  %145 = fmul float %127, %value_phi10.i
  %146 = fmul float %131, %145
  %147 = fmul float %130, %value_phi10.i
  %148 = fmul float %129, %147
  %149 = fmul float %127, %value_phi9.i
  %150 = fmul float %129, %149
  %151 = fadd float %146, %144
  %152 = fadd float %148, %151
  %153 = fadd float %150, %152
  %154 = fmul float %130, %value_phi8.i
  %155 = fmul float %131, %154
  %156 = fmul float %127, %value_phi7.i
  %157 = fmul float %131, %156
  %158 = fmul float %130, %value_phi7.i
  %159 = fmul float %129, %158
  %160 = fmul float %127, %value_phi6.i
  %161 = fmul float %129, %160
  %162 = fadd float %157, %155
  %163 = fadd float %159, %162
  %164 = fadd float %161, %163
  %165 = icmp ne i64 %68, -9223372036854775808
  %166 = or i1 %62, %165
  %167 = and i1 %63, %166
  br i1 %167, label %pass16.i, label %fail15.i

L1277.i:                                          ; preds = %pass16.i, %L1277.i
  %value_phi19.i = phi float [ %225, %pass16.i ], [ %173, %L1277.i ]
  %168 = fsub float %value_phi19.i, %210
  %169 = bitcast float %value_phi19.i to i32
  %170 = bitcast float %168 to i32
  %171 = cmpxchg i32 addrspace(1)* %226, i32 %169, i32 %170 acq_rel acquire
  %172 = extractvalue { i32, i1 } %171, 0
  %173 = bitcast i32 %172 to float
  %174 = fcmp une float %value_phi19.i, %173
  br i1 %174, label %L1277.i, label %L1288.i

L1288.i:                                          ; preds = %L1277.i
  %175 = fdiv float %211, %142
  %176 = add i64 %213, %67
  %177 = shl i64 %176, 2
  %178 = add i64 %177, -4
  %179 = getelementptr i8, i8 addrspace(1)* %.fca.1.extract60, i64 %178
  %ptr.i59.i = bitcast i8 addrspace(1)* %179 to float addrspace(1)*
  %180 = atomicrmw fadd float addrspace(1)* %ptr.i59.i, float %175 seq_cst
  %181 = add i64 %216, %67
  %182 = shl i64 %181, 2
  %183 = add i64 %182, -4
  %184 = getelementptr i8, i8 addrspace(1)* %.fca.1.extract60, i64 %183
  %185 = bitcast i8 addrspace(1)* %184 to float addrspace(1)*
  %186 = load float, float addrspace(1)* %185, align 1
  %187 = bitcast i8 addrspace(1)* %184 to i32 addrspace(1)*
  br label %L1336.i

L1336.i:                                          ; preds = %L1336.i, %L1288.i
  %value_phi20.i = phi float [ %186, %L1288.i ], [ %193, %L1336.i ]
  %188 = fsub float %value_phi20.i, %175
  %189 = bitcast float %value_phi20.i to i32
  %190 = bitcast float %188 to i32
  %191 = cmpxchg i32 addrspace(1)* %187, i32 %189, i32 %190 acq_rel acquire
  %192 = extractvalue { i32, i1 } %191, 0
  %193 = bitcast i32 %192 to float
  %194 = fcmp une float %value_phi20.i, %193
  br i1 %194, label %L1336.i, label %L1347.i

L1347.i:                                          ; preds = %L1336.i
  %.not.not81 = icmp eq i64 %value_phi3.i, %49
  %195 = add i64 %value_phi3.i, %48
  br i1 %.not.not81, label %_Z32julia_Fdrift_CUDA_kernel_xy_894113CuDeviceArrayI7Float32Li2ELi1EES_IS0_Li2ELi1EE5Int64S1_S0_S1_S_IS0_Li1ELi1EES_IS0_Li1ELi1EES_IS0_Li2ELi1EES_IS0_Li2ELi1EES_IS0_Li2ELi1EE.inner.exit, label %L169.i

fail15.i:                                         ; preds = %L1141.i
  call fastcc void @gpu_report_exception(i64 ptrtoint ([10 x i8]* @exception9 to i64))
  call fastcc void @gpu_signal_exception()
  call void asm sideeffect "exit;", ""() #4
  br label %pass16.i

pass16.i:                                         ; preds = %fail15.i, %L1141.i
  %196 = sdiv i64 %68, %35
  %197 = mul i64 %196, %35
  %198 = sub i64 %value_phi3.i, %197
  %199 = add i64 %198, %2
  %.neg = mul i64 %199, -8
  %200 = add i64 %.neg82, %.neg
  %201 = sitofp i64 %200 to double
  %202 = call double @llvm.sqrt.f64(double %201) #4
  %203 = fsub double %65, %202
  %204 = fmul double %203, 5.000000e-01
  %205 = fptosi double %204 to i64
  %206 = add i64 %205, -1
  %207 = mul i64 %206, %205
  %208 = sdiv i64 %207, 2
  %209 = fmul float %153, 2.000000e+00
  %210 = fdiv float %209, %142
  %211 = fmul float %164, 2.000000e+00
  %212 = mul i64 %196, %2
  %213 = add i64 %212, %205
  %reass.add = sub i64 %196, %206
  %reass.mul = mul i64 %reass.add, %2
  %214 = add i64 %198, %205
  %215 = add i64 %214, %208
  %216 = add i64 %215, %reass.mul
  %217 = shl i64 %213, 2
  %218 = add i64 %217, -4
  %219 = getelementptr i8, i8 addrspace(1)* %.fca.1.extract60, i64 %218
  %ptr.i.i = bitcast i8 addrspace(1)* %219 to float addrspace(1)*
  %220 = atomicrmw fadd float addrspace(1)* %ptr.i.i, float %210 seq_cst
  %221 = shl i64 %216, 2
  %222 = add i64 %221, -4
  %223 = getelementptr i8, i8 addrspace(1)* %.fca.1.extract60, i64 %222
  %224 = bitcast i8 addrspace(1)* %223 to float addrspace(1)*
  %225 = load float, float addrspace(1)* %224, align 1
  %226 = bitcast i8 addrspace(1)* %223 to i32 addrspace(1)*
  br label %L1277.i

_Z32julia_Fdrift_CUDA_kernel_xy_894113CuDeviceArrayI7Float32Li2ELi1EES_IS0_Li2ELi1EE5Int64S1_S0_S1_S_IS0_Li1ELi1EES_IS0_Li1ELi1EES_IS0_Li2ELi1EES_IS0_Li2ELi1EES_IS0_Li2ELi1EE.inner.exit: ; preds = %L1347.i, %L78.i
  ret void
}

aha… then shall I understand that my impression was wrong, in the sense that nothing is computed if not written to the array? And therefore that there is nothing else to optimize here, considering the function MUST write to the array Fk?

Correct.

Why would that be the case? The current implementation can always be suboptimal.

1 Like

On my system, for example, the original kernel takes 600us. Using the occupancy API to compute a better launch configuration (since you’re using a grid-stride loop a nyway) brings that down to 550 or so. But removing @atomic makes it drop to 190us, which means you can use a better way to accumulate that data. Since the written values aren’t read by any other thread, you could maybe accumulate into shared memory, or use warp shuffle, but that depends on how the indices are computed.

you’re surely right -though I don’t know how to do that. Is there ant doc/tutorial where I can learn how to use shared memory or warp shuffle in julia CUDA?
Sometimes I find it a little bit frustrating that I can not find good resources where to learn all these things from :confused:
Thanks,
Ferran.

It’s a pattern that happens a lot, e.g., with reductions: Chapter 39. Parallel Prefix Sum (Scan) with CUDA and Faster Parallel Reductions on Kepler | NVIDIA Developer Blog. Generally you want to avoid global memory accesses as you’re doing for every thread. You can buffer results in shared memory (which is local to the block) or even communicate directly between threads in a warp (groups of 32 threads), but that depends on your algorithm and whether it’s even possible for threads to cooperate like that.

3 Likes