Hello,
I am new to Julia and I wanted to use Float32 to debug my code expecting it be faster as my gpu provide only 1/64 of its performance in Float64.
I was surprised to see that using Float32 was almost as slow as Float64. I would like to know what I am doing wrong …
Here is one kernel I used to benchmark the difference of performance. This kernel is used to get a 2D velocity field from force balance equation.
using CUDA
block_dim = (16,16)
grid_dim = (63,63)
const N::Int32 = 16*63
function kernel_comp_v!(vn, v, F)
# Indexing (i,j)
i::Int32 = (blockIdx().x - 1f0) * blockDim().x + threadIdx().x
j::Int32 = (blockIdx().y - 1f0) * blockDim().y + threadIdx().y
i_::Int32 = i== 1f0 ? N : i-1f0; ip::Int32 = i== N ? 1f0 : i+1f0
jm::Int32 = j== 1f0 ? N : j-1f0; jp::Int32 = j== N ? 1f0 : j+1f0
@inbounds begin
vn[i,j,1] = 1/(1+6/Δ2)*(F[i,j,1] + ( 2*(v[i_,j,1]+v[ip,j,1]) + v[i,jm,1]+v[i,jp,1] + 0.25*(v[ip,jp,2]+v[i_,jm,2]-v[i_,jp,2]-v[ip,jm,2]) )/Δ2)
vn[i,j,2] = 1/(1+6/Δ2)*(F[i,j,2] + ( 2*(v[i,jm,2]+v[i,jp,2]) + v[i_,j,2]+v[ip,j,2] + 0.25*(v[ip,jp,1]+v[i_,jm,1]-v[i_,jp,1]-v[ip,jm,1]) )/Δ2)
end
return nothing
end
for Float64 I do
v = CUDA.zeros(Float64, N, N, 2)
v_temp = CUDA.zeros(Float64, N, N, 2)
F = CUDA.zeros(Float64, N, N, 2)
F[:,:,1] .= 1.0
comp_v! = @cuda launch=false kernel_comp_v!(v_temp, v, F)
@benchmark CUDA.@sync for _ = 1:10
comp_v!($v_temp, $v, $F; threads = block_dim, blocks = grid_dim)
comp_v!($v, $v_temp, $F; threads = block_dim, blocks = grid_dim)
end
For Float32 I just replace the 64 by 32 after restarting julia.
For both Float64 and Float32 I get approximatly 9.5 \pm 0.5ms.
I really don’t understand what I am doing wrong.
Thank you in advance for your help.
Best regards.