I built a new PC and I’ve been toying around with some code just to see how it performs and I’m trying to understand why the following code (from the CUDA.jl docs) runs slower on my GPU than on my CPU (I have a Ryzen 9 3950x CPU (16 core/32 thread) and an RTX 2080 Super GPU):
using BenchmarkTools using CuArrays using Test N = 2^21 x = fill(1.0f0, N) # a vector filled with 1.0 (Float32) y = fill(2.0f0, N) # a vector filled with 2.0 y .+= x function sequential_add!(y, x) for i in eachindex(y, x) @inbounds y[i] += x[i] end return nothing end fill!(y, 2) sequential_add!(y, x) @test all(y .== 3.0f0) function parallel_add!(y, x) Threads.@threads for i in eachindex(y, x) @inbounds y[i] += x[i] end return nothing end fill!(y, 2) parallel_add!(y, x) @test all(y .== 3.0f0) # Parallelizaton on the GPU x_d = CuArrays.fill(1.0f0, N) # a vector stored on the GPU filled with 1.0 (Float32) y_d = CuArrays.fill(2.0f0, N) # a vector stored on the GPU filled with 2.0 y_d .+= x_d @test all(Array(y_d) .== 3.0f0) function add_broadcast!(y, x) CuArrays.@sync y .+= x return end
The results are here:
julia> @btime sequential_add!($y, $x) 254.301 μs (0 allocations: 0 bytes) julia> @btime parallel_add!($y, $x) 44.499 μs (114 allocations: 13.67 KiB) julia> @btime add_broadcast!($y_d, $x_d) 106.000 μs (56 allocations: 2.22 KiB)
As you can see, the CPU crushes the GPU with this computation (I love my new CPU )
Lastly, for a 16 core, 32 thread CPU, is it okay to set
JULIA_NUM_THREADS to 32, or should it equal the number of physical cores? I currently have it set at 16.