Parallelizaton on GPU slower than on CPU...?

Elrod · January 21, 2020, 12:54am

Those are some impressive numbers on the 1950X!

Note that even though the two arrays only take up 16 MiB (2^21 * 4 * 2 / 2^20 = 2^4), the computation is memory bound.

julia> N = 2^21
2097152

julia> flops = 10^6 * N / 44.499
4.712807029371446e10

I don’t know what clock speed your CPU runs at all-core, so I’ll pick 4 GHz:

julia> Hz = 4e9; fma_per_clock = 2; flop_per_fma = 16; cores = 16;

julia> Hz * fma_per_clock * flop_per_fma * cores
2.048e12

julia> ans / flops
43.4560546875

Your CPU was mostly sitting, waiting for data. For every nanosecond it spent computing, there were 40 doing nothing.

For comparison, on my 10980XE, my sequential and parallel times were 705 and 58 microseconds.
Thus, my numbers are

julia> Hz = 4.1e9; fma_per_clock = 2; flop_per_fma = 32; cores = 18;

julia> Hz * fma_per_clock * flop_per_fma * cores
4.7232e12

julia> ans / (10^6 * N / 58)
130.62744140625

Yikes. My ratio was about 130.

I don’t know much about GPU computing, but I bet you couldn’t bring it’s number crunching power to bear. Longer vectors would just make the memory problems worse.

I also don’t enough yet about memory to say anything about TLB misses vs memory bandwidth, but I’ll start looking into that sort of thing one day.

For memory bound operations, memory performance dominates. Regardless of the reason, the Ryzen 3950X looks amazing here.

Topic		Replies	Views
Parallelization on the CPU isn't effective General Usage	19	541	November 19, 2021
Why is my GPU kernel an order of magnitude slower than my CPU function? GPU question	8	242	June 4, 2025
How much faster is GPU compare to CPU GPU	16	26727	November 24, 2018
Combining CPU and GPU New to Julia gpu , performance	9	1455	March 18, 2022
Why does GPU addition slows down as the array get larger compared to other methods? GPU performance	7	504	August 25, 2023

Parallelizaton on GPU slower than on CPU...?

Related topics