I built a new PC and I’ve been toying around with some code just to see how it performs and I’m trying to understand why the following code (from the CUDA.jl docs) runs slower on my GPU than on my CPU (I have a Ryzen 9 3950x CPU (16 core/32 thread) and an RTX 2080 Super GPU):
using BenchmarkTools
using CuArrays
using Test
N = 2^21
x = fill(1.0f0, N) # a vector filled with 1.0 (Float32)
y = fill(2.0f0, N) # a vector filled with 2.0
y .+= x
function sequential_add!(y, x)
for i in eachindex(y, x)
@inbounds y[i] += x[i]
end
return nothing
end
fill!(y, 2)
sequential_add!(y, x)
@test all(y .== 3.0f0)
function parallel_add!(y, x)
Threads.@threads for i in eachindex(y, x)
@inbounds y[i] += x[i]
end
return nothing
end
fill!(y, 2)
parallel_add!(y, x)
@test all(y .== 3.0f0)
# Parallelizaton on the GPU
x_d = CuArrays.fill(1.0f0, N) # a vector stored on the GPU filled with 1.0 (Float32)
y_d = CuArrays.fill(2.0f0, N) # a vector stored on the GPU filled with 2.0
y_d .+= x_d
@test all(Array(y_d) .== 3.0f0)
function add_broadcast!(y, x)
CuArrays.@sync y .+= x
return
end
The results are here:
julia> @btime sequential_add!($y, $x)
254.301 μs (0 allocations: 0 bytes)
julia> @btime parallel_add!($y, $x)
44.499 μs (114 allocations: 13.67 KiB)
julia> @btime add_broadcast!($y_d, $x_d)
106.000 μs (56 allocations: 2.22 KiB)
As you can see, the CPU crushes the GPU with this computation (I love my new CPU )
Lastly, for a 16 core, 32 thread CPU, is it okay to set JULIA_NUM_THREADS
to 32, or should it equal the number of physical cores? I currently have it set at 16.