Those are some impressive numbers on the 1950X!
Note that even though the two arrays only take up 16 MiB (2^21 * 4 * 2 / 2^20 = 2^4), the computation is memory bound.
julia> N = 2^21
2097152
julia> flops = 10^6 * N / 44.499
4.712807029371446e10
I don’t know what clock speed your CPU runs at all-core, so I’ll pick 4 GHz:
julia> Hz = 4e9; fma_per_clock = 2; flop_per_fma = 16; cores = 16;
julia> Hz * fma_per_clock * flop_per_fma * cores
2.048e12
julia> ans / flops
43.4560546875
Your CPU was mostly sitting, waiting for data. For every nanosecond it spent computing, there were 40 doing nothing.
For comparison, on my 10980XE, my sequential and parallel times were 705 and 58 microseconds.
Thus, my numbers are
julia> Hz = 4.1e9; fma_per_clock = 2; flop_per_fma = 32; cores = 18;
julia> Hz * fma_per_clock * flop_per_fma * cores
4.7232e12
julia> ans / (10^6 * N / 58)
130.62744140625
Yikes. My ratio was about 130.
I don’t know much about GPU computing, but I bet you couldn’t bring it’s number crunching power to bear. Longer vectors would just make the memory problems worse.
I also don’t enough yet about memory to say anything about TLB misses vs memory bandwidth, but I’ll start looking into that sort of thing one day.
For memory bound operations, memory performance dominates. Regardless of the reason, the Ryzen 3950X looks amazing here.