Parallelizaton on GPU slower than on CPU...?

I built a new PC and I’ve been toying around with some code just to see how it performs and I’m trying to understand why the following code (from the CUDA.jl docs) runs slower on my GPU than on my CPU (I have a Ryzen 9 3950x CPU (16 core/32 thread) and an RTX 2080 Super GPU):

using BenchmarkTools
using CuArrays
using Test

N = 2^21
x = fill(1.0f0, N)  # a vector filled with 1.0 (Float32)
y = fill(2.0f0, N)  # a vector filled with 2.0
y .+= x   

function sequential_add!(y, x)
    for i in eachindex(y, x)
        @inbounds y[i] += x[i]
    end
    return nothing
end

fill!(y, 2)
sequential_add!(y, x)
@test all(y .== 3.0f0)

function parallel_add!(y, x)
    Threads.@threads for i in eachindex(y, x)
        @inbounds y[i] += x[i]
    end
    return nothing
end

fill!(y, 2)
parallel_add!(y, x)
@test all(y .== 3.0f0)

# Parallelizaton on the GPU
x_d = CuArrays.fill(1.0f0, N)  # a vector stored on the GPU filled with 1.0 (Float32)
y_d = CuArrays.fill(2.0f0, N)  # a vector stored on the GPU filled with 2.0
y_d .+= x_d
@test all(Array(y_d) .== 3.0f0)

function add_broadcast!(y, x)
    CuArrays.@sync y .+= x
    return
end

The results are here:

julia> @btime sequential_add!($y, $x)
  254.301 μs (0 allocations: 0 bytes)

julia> @btime parallel_add!($y, $x)
  44.499 μs (114 allocations: 13.67 KiB)

julia> @btime add_broadcast!($y_d, $x_d)
  106.000 μs (56 allocations: 2.22 KiB)

As you can see, the CPU crushes the GPU with this computation (I love my new CPU :smiling_face_with_three_hearts:)

Lastly, for a 16 core, 32 thread CPU, is it okay to set JULIA_NUM_THREADS to 32, or should it equal the number of physical cores? I currently have it set at 16.

1 Like

Are the results the same if you further increase the length of the vectors?

I would keep it at 16, the hyper threads do not have their own cache memory and would mostly compete for resources with the native threads.

2 Likes

Are the results the same if you further increase the length of the vectors?

I tried with N = 2^20, N = 2^21, N = 2^22 and N = 2^23 and parallel_add! was faster than add_broadcast!. However, at N = 2^27, add_broadcast! is much faster:

julia> @btime sequential_add!($y, $x)
  60.774 ms (0 allocations: 0 bytes)

julia> @btime parallel_add!($y, $x)
  57.521 ms (114 allocations: 13.67 KiB)

julia> @btime add_broadcast!($y_d, $x_d)
  3.745 ms (56 allocations: 2.22 KiB)

So, to decide whether or not it’s worth doing something on the GPU, is the best way trial-and-error, or is there some sort of rule of thumb to go by?

Thanks :grinning:

I can imagine that it varies a lot with what you do with the array. Try exp and I bet that the GPU will be faster much earlier.

1 Like

Those are some impressive numbers on the 1950X!

Note that even though the two arrays only take up 16 MiB (2^21 * 4 * 2 / 2^20 = 2^4), the computation is memory bound.

julia> N = 2^21
2097152

julia> flops = 10^6 * N / 44.499
4.712807029371446e10

I don’t know what clock speed your CPU runs at all-core, so I’ll pick 4 GHz:

julia> Hz = 4e9; fma_per_clock = 2; flop_per_fma = 16; cores = 16;

julia> Hz * fma_per_clock * flop_per_fma * cores
2.048e12

julia> ans / flops
43.4560546875

Your CPU was mostly sitting, waiting for data. For every nanosecond it spent computing, there were 40 doing nothing.

For comparison, on my 10980XE, my sequential and parallel times were 705 and 58 microseconds.
Thus, my numbers are

julia> Hz = 4.1e9; fma_per_clock = 2; flop_per_fma = 32; cores = 18;

julia> Hz * fma_per_clock * flop_per_fma * cores
4.7232e12

julia> ans / (10^6 * N / 58)
130.62744140625

Yikes. My ratio was about 130.

I don’t know much about GPU computing, but I bet you couldn’t bring it’s number crunching power to bear. Longer vectors would just make the memory problems worse.

I also don’t enough yet about memory to say anything about TLB misses vs memory bandwidth, but I’ll start looking into that sort of thing one day.

For memory bound operations, memory performance dominates. Regardless of the reason, the Ryzen 3950X looks amazing here.

3 Likes

For memory bound operations, memory performance dominates.

I kept this in mind for my build and went with DDR4-3600 RAM as well as an M.2-2280 NVME SSD :grin:. I also got inspired to start a thread for showing off Julia performance on PCs that people build/have so check it out! Thanks so much for your response!!

2 Likes

But because both arrays here are still only 16 MiB, those don’t matter for this specific benchmark.
The 10980XE has 18MiB of shared L3 cache, while the 1950X has 16 MiB of L3 cache per 4 cores (for 64 MiB total).

If I recall correctly, on Zen3, much of the on CPU memory’s clock matches the RAM clock up until 3600 MHz. Meaning you may have those speeds set much higher than I do.
I didn’t overclock/adjust “uncore” performance at all, and have no idea how much ground I can gain from that. Probably worth at least looking at, but whatever I do it’ll probably be mild since I don’t really want to risk crashing the computer. I wouldn’t expect to gain much ground on your benchmark performance here.

FWIW, I’m also on an M.2 SSD, but I don’t remember which at the moment, and DDR4-3200 RAM (14 CAS latency, IIRC).

You should try some more compute-heavy benchmarks in that thread. I’d add matmul, at least ;).

1 Like

From what I gather reading discussions about this, the concept of a clock frequency is pretty fluid in late Ryzens, and the CPU monitors itself to keep its performance close to optimal. It is of course possible that one can improve on the defaults, but the gains seems to be rather small.

1 Like

Uncore refers to:

Uncore functions include QPI controllers, L3 cache, snoop agent pipeline, on-die memory controller, and Thunderbolt controller

This (L3 cache) is probably the most important (hardware) capability being benchmarked here.

While the core includes the execution units as well as the L1 and L2 cache.

Apparently “uncore” is an Intel term, so AMD may do things differently.
I believe that the infinity fabric clock matches the RAM clock rate up to 3600 MHz (which is an overclock; the “maximum” is 3200). After this the infinity fabric slows down to 2-to-1, making 3600 “optimal”.
However, I don’t know what all infinity fabric entails – whether it includes L3 cache performance.

I also much prefer’s AMD’s intelligent clock-speed algorithms.

1 Like

My post is a bit stale at this point, but I did an example to understand this same question:

What (I believe) I demonstrated was that for smaller problems, the data transfer time eats away at the potential speedup for GPU (as seen by the horizontal line up to 1000x1000 matrix). Once you problem gets larger, then you start to see the GPU start to shine.

3 Likes

@randyzwitch Really nice post, thank you.

1 Like