Parallelizaton on GPU slower than on CPU...?

Those are some impressive numbers on the 1950X!

Note that even though the two arrays only take up 16 MiB (2^21 * 4 * 2 / 2^20 = 2^4), the computation is memory bound.

julia> N = 2^21
2097152

julia> flops = 10^6 * N / 44.499
4.712807029371446e10

I don’t know what clock speed your CPU runs at all-core, so I’ll pick 4 GHz:

julia> Hz = 4e9; fma_per_clock = 2; flop_per_fma = 16; cores = 16;

julia> Hz * fma_per_clock * flop_per_fma * cores
2.048e12

julia> ans / flops
43.4560546875

Your CPU was mostly sitting, waiting for data. For every nanosecond it spent computing, there were 40 doing nothing.

For comparison, on my 10980XE, my sequential and parallel times were 705 and 58 microseconds.
Thus, my numbers are

julia> Hz = 4.1e9; fma_per_clock = 2; flop_per_fma = 32; cores = 18;

julia> Hz * fma_per_clock * flop_per_fma * cores
4.7232e12

julia> ans / (10^6 * N / 58)
130.62744140625

Yikes. My ratio was about 130.

I don’t know much about GPU computing, but I bet you couldn’t bring it’s number crunching power to bear. Longer vectors would just make the memory problems worse.

I also don’t enough yet about memory to say anything about TLB misses vs memory bandwidth, but I’ll start looking into that sort of thing one day.

For memory bound operations, memory performance dominates. Regardless of the reason, the Ryzen 3950X looks amazing here.

3 Likes