How much faster is GPU compare to CPU


#1

Hi, I heard the amazing things about GPU and how much faster it can beat CPU. But I don’t know what kind of speed up is expected. So I ran a matrix multiplication to do the comparison.

I ran my experiment on a server that has a Intel Gold 6148 CPU, which has 20 cores at 2.40 GHz frequency, and a GPU of NVIDIA V100 16GB memory. For CPU I used OpenBLAS library, and CuBLAS for the GPU. I think both libraries are highly optimized and should be a fair comparison. The OpenBLAS thread count default on this server is 16 threads. I think compare GPU to single thread CPU is cheating. To make comparison more fair, I also included data transfer time to and from GPU in the gpu_time, because this is the overhead of using GPU.

I ran the double precision gemm. and the best result for speed up is 5 times (cpu_time/gpu_time) for a matrix of size 30k*16k( this is the biggest size of matrix I can fit into a 16GB memory of GPU). Most of the time, there is no speed up at all. I often hear people get 20-100 times speed up. So this result is a little disappointing. Also I think matrix multiplication is a very GPU friendly calculation because of it’s highly paralellizable.

So what is your result for speed up? and what is the application you have. Is it better to run a more complex function on GPU than a simple matrix multiplication?

Thank you!


#2

I think the crux of the issue is that you included transfer time.
If you do have to transfer data to and from the GPU, I think it’s really unlikely for that to be worth it.
However, if you can keep your data on the GPU, doing the vast majority of your calculations there, then it is worthwhile.

Training neural networks is an ideal example of that.

Personally, I have struggled to make good use of GPUs. I still really want to sit down sometime and actually learn how to optimize them, or simply write code that can run on them and stay there.
(This is why I am such a big fan of AVX-512. While it still doesn’t come close to GPU-speed/$, it is a big step up from avx2, and something I’ve actually been capable of using.)


#3

A GPU is not faster than a CPU. In fact, it’s about an order of magnitude slower. However, you get about 3000 cores. But these cores are not able to act independently, so they essentially all have to do the same calculations in lock step. Additionally, there is a data transfer cost. This means that, if you want to do 3000 of the same simple calculation all at the same time, GPUs are great, otherwise they are awful. It turns out that linear algebra does exactly the kinds of operations that are highly GPU parallel, so matrix multiplication is pretty much the canonical use case for GPUs. Broadcasted operations are another use case, but only if what’s broadcasted is “sufficiently expensive”. Usually it’s this domain where you see the whopping 150x speedups by custom writing a kernel for some mathematical problem and calling it on 3000 parameters at a time. Matrix multiply can do well, but it has a heavy data transfer cost that can offset a lot of the benefits if the matrix is only multiplied once. If you store things on the GPU and keep re-using them in subsequent operations in some iterative algorithm, you will see more of a benefit.

Note that GPUs typically are very slow at 64-bit floating point operations but much faster at 32-bit floating point operations, so you should try to use fp32 on most GPUs (other than the Tesla line, but even then fp32 is still faster).


#4

I think another interesting criteria here is to see how much it costs to buy a 16 core 2.40 GHz CPU, and then compare that to the same money-worth of GPU resources.


#5

Although the Tesla line, like the v100, are only about twice as fast for single precision, which is the same difference you’ll see on most CPUs:

julia> A64 = randn(5000,5000);

julia> B64 = randn(5000,5000);

julia> C64 = randn(5000,5000);

julia> A32 = randn(Float32, 5000,5000);

julia> B32 = randn(Float32, 5000,5000);

julia> C32 = randn(Float32, 5000,5000);

julia> using LinearAlgebra, BenchmarkTools

julia> @benchmark mul!($C64, $A64, $B64)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     280.856 ms (0.00% GC)
  median time:      281.497 ms (0.00% GC)
  mean time:        281.964 ms (0.00% GC)
  maximum time:     285.681 ms (0.00% GC)
  --------------
  samples:          18
  evals/sample:     1

julia> @benchmark mul!($C32, $A32, $B32)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     146.662 ms (0.00% GC)
  median time:      147.056 ms (0.00% GC)
  mean time:        152.194 ms (0.00% GC)
  maximum time:     221.797 ms (0.00% GC)
  --------------
  samples:          34
  evals/sample:     1

According to Wikipedia, the v100 comes in two varieties, that get 14899/7450 and 14028/7014 single/double precision fma gflops, or about a 2/1 ratio, which isn’t too far off from the 1.9x we see in CPU matrix multiplication.


#6

Considering that:

  • you have enough //ism (several thousand of threads)
  • you deal with supported FP accuracy on your GPU (let’s say FP32)
  • you can put most of your computation on the GPU and get rid of the CPU-GPU transfer
  • The GPU memory size is large enough for your problem
  • You computation is not dominated by diverging algorithms (with a lot a different code branch for threads)

then you have to characterize your problem according the so called arithmetic intensity of your algorithm (which measures the number of operations for each memory transfer).

For an elementary algorithm you can be either

  • compute bound : (e.g. gemm where you have N^3 ops vs N^2 data)
  • memory bound : most of classical N.A. algorithms

Then the possible speedup on the GPU is easy to predict considering either the bandwidth ratio or the peak perf ratio. Typically:

  • memory bound : x5-x10 for the bandwidth (600GB/s on the GPU VRAM and 50 GB/s on the CPU RAM)
  • compute bound : x100 (10 FP32 TFlops on the GPU compared to 100 FP32 GFlops on the CPU)

For more details on arithmetic intensity see for example : https://en.wikipedia.org/wiki/Roofline_model


#7

I think the consensus here is that an algorithm that keeps the data in GPU as long as possible, and highly parallel and minimal divergence is a good candidate for GPU. I agree the latter two, but I was not aware of the first one. The CPU I used cost about $3000 and the V100 cost about $7000, so just considering the cost, GPU did slightly better, but if you factor in the cost of labor to do the GPU computing, I wouldn’t say that GPU is cheap. (Of course, with Julia’s GPU package, the developing cost is lower, but there are still expertise required to identify the part of code that can be done on GPU).


#8

Do you have code on Github that uses this functionality? I am interested to see how much extra effort is involved to use AVX-512, and how much speed up you can have.


#9

Thank you for the reply. Can you elaborate on the details on how to calculate the bandwidth ratio and peak performance ratio? I always think that it is difficult to predict unless you run it with actual GPU, but this prediction would be very helpful on deciding whether to buy certain hardware.


#10

Do you already know if your algorithm is memory or compute bound ? On Intel CPU the recent versions of vtune can give you a direct diagnostic (and also a point in the roofline model). If you are memory bound (by far the most common case in my experience) then the AVX-512 will not help.
If you spent your time in dgemm for large matrices your measurements seem to be valid since the CPU you mentioned (6148) is given for more than 1TFlops and the V100 at 7.8 TFlops. Clearly, for this class of processors the very pricey GPU like V100 did not improve very much compared to previous generations (it improves a lot for less embarrassingly // computations and for half precision).
From you initial post it is not clear if you have a specific application in mind or if you want to buy a machine for general purpose and multi-users.
In my experience, if you develop your own computation kernels, it is more easy to approach peak performance (both for bandwidth and compute) on the GPU (when the algorithms meets all the criteria) than on CPU (mainly because the shared memory can be explicitly manipulated which is not the case with CPU caches).
Again, apart from very special cases like library gemm implementation (linpack family), the vast majority of large parallel codes that I have observed where very far from the compute peak performance (<10% and usually <1%).
Not that these codes usually exhibit a very nice scalability with the processor number. Note also that the scalability tends to decrease when the performance per processor increases :wink:
The HPCG benchmark is far more representative of real life computations. In these “common” cases the bandwidth (and latency) of the machine is the dominant feature.

My personal conclusion about this HPC business is that, at the very end, there is generally a lot more potential performance to be gained in improving the algorithms and maths, than to hunt the very last drops of computing perfs. That is why Julia is so important :wink:


#11

I am only doing a research to better understand GPU performance. I have not put my own code in GPU to test. Right now I use a server, and sometimes I have to wait for other people’s job to finish before my job is ran. So I am also considering buying a hardware for our internal use. So even the expensive V100 in the server, working on a highly parallelizable problem, is only 5 times faster. That was a little disappointing. So to conclude, writing a costume kernel to utilize shared memory and keep computation in GPU as much as possible, can help with GPU performance.

Also thanks everybody for helping me understand the GPU business.


#12

Note that the V100 is even slower than the GeForce line on single precision (just a few hundred dollars). Tesla is wildly overpriced only for it’s EEC memory and double precision support. If you can do single precision then much much cheaper cards are fine.


#13

Which GeForce GPU are you talking about? The V100 has 30/15/7.5 TFLOPs for FP16/32/64, while eg. the recently released, top-of-the-line 2080Ti GeForce (which already costs €1300) only has 26/13/0.4 TFLOPs… Not to speak about the Volta hardware features that until recently were only available on the V100.


#14

The prominent V100 feature it’s tensor cores and DNN applications. For applications that were already efficient with Pascal architectures, I suppose that the price increase is more difficult to accept.

Back to the initial question, I forgot to mention the approximate hard coded maths functions (exp sin sqrt…) that can lead to spectacular speed ups compared to IEEE soft implementations.

Last, I speculate that the new RTX cards could have new spectacular scientific applications in Monte Carlo Transport, or places where ray tracing is important.


#15

I’m still talking Tesla K40 vs GeForce 980. I guess that part changed. But still, top end GeForce is 26 TFLOPS vs 30 FLOPS.


#16
julia> peakflops() / 1024^3

gives me ~30 gflops per Cpu core, for a total of 120 (in theory)
The GPU here is integrated, but still has 400 shader cores rated for 480 gflop

It also uses the main system ram so it might be interesting to benchmark on more intense memory-swapping tasks


#17

Hi!

Often, it takes no extra effort at all.

function dot_product(x, y)
    out = zero(promote_type(eltype(x), eltype(y)))
    @inbounds for i in eachindex(x,y)
        out += x[i] * y[i]
    end
    out
end

x, y = randn(256), randn(256);

On a computer with avx512:

julia> @code_native dot_product(x, y)
	.text
; Function dot_product {
; Location: REPL[4]:3
; Function eachindex; {
; Location: abstractarray.jl:207
; Function eachindex; {
; Location: abstractarray.jl:217
; Function eachindex; {
; Location: abstractarray.jl:214
; Function axes1; {
; Location: abstractarray.jl:93
; Function axes; {
; Location: abstractarray.jl:75
; Function size; {
; Location: REPL[4]:2
	subq	$40, %rsp
	movq	24(%rdi), %rax
;}
; Function map; {
; Location: tuple.jl:165
; Function Type; {
; Location: range.jl:314
; Function Type; {
; Location: range.jl:305
; Function max; {
; Location: promotion.jl:414
	movq	%rax, %rcx
	sarq	$63, %rcx
	andnq	%rax, %rcx, %r8
;}}}}}}}
; Location: abstractarray.jl:218
; Function _all_match_first; {
; Location: abstractarray.jl:224
; Function #89; {
; Location: abstractarray.jl:218
; Function eachindex; {
; Location: abstractarray.jl:214
; Function axes1; {
; Location: abstractarray.jl:93
; Function axes; {
; Location: abstractarray.jl:75
; Function size; {
; Location: array.jl:155
	movq	24(%rsi), %rcx
;}
; Function map; {
; Location: tuple.jl:165
; Function Type; {
; Location: range.jl:314
; Function Type; {
; Location: range.jl:305
; Function max; {
; Location: promotion.jl:414
	movq	%rcx, %rdx
	sarq	$63, %rdx
	andnq	%rcx, %rdx, %rcx
;}}}}}}}}}
; Function _all_match_first; {
; Location: promotion.jl:403
	cmpq	%rcx, %r8
;}
	jne	L300
; Location: abstractarray.jl:217
; Function eachindex; {
; Location: abstractarray.jl:214
; Function axes1; {
; Location: abstractarray.jl:93
; Function axes; {
; Location: abstractarray.jl:75
; Function map; {
; Location: tuple.jl:165
; Function Type; {
; Location: range.jl:314
; Function Type; {
; Location: range.jl:305
; Function max; {
; Location: promotion.jl:414
	testq	%rax, %rax
;}}}}}}}}}
	jle	L76
	movq	(%rdi), %rcx
	movq	(%rsi), %rdx
; Location: REPL[4]:3
	cmpq	$32, %r8
	jae	L88
	vxorpd	%xmm0, %xmm0, %xmm0
	movl	$1, %esi
	jmp	L259
L76:
	vxorps	%xmm0, %xmm0, %xmm0
; Location: REPL[4]:6
	addq	$40, %rsp
	vzeroupper
	retq
; Location: REPL[4]:3
L88:
	movabsq	$9223372036854775776, %rdi # imm = 0x7FFFFFFFFFFFFFE0
	andq	%r8, %rdi
	leaq	1(%rdi), %rsi
	vxorpd	%xmm0, %xmm0, %xmm0
	xorl	%eax, %eax
	vxorpd	%xmm1, %xmm1, %xmm1
	vxorpd	%xmm2, %xmm2, %xmm2
	vxorpd	%xmm3, %xmm3, %xmm3
	nopl	(%rax,%rax)
; Location: REPL[4]:4
; Function getindex; {
; Location: array.jl:739
L128:
	vmovupd	(%rdx,%rax,8), %zmm4
	vmovupd	64(%rdx,%rax,8), %zmm5
	vmovupd	128(%rdx,%rax,8), %zmm6
	vmovupd	192(%rdx,%rax,8), %zmm7
;}
; Function +; {
; Location: float.jl:395
	vfmadd231pd	(%rcx,%rax,8), %zmm4, %zmm0
	vfmadd231pd	64(%rcx,%rax,8), %zmm5, %zmm1
	vfmadd231pd	128(%rcx,%rax,8), %zmm6, %zmm2
	vfmadd231pd	192(%rcx,%rax,8), %zmm7, %zmm3
	addq	$32, %rax
	cmpq	%rax, %rdi
	jne	L128
	vaddpd	%zmm0, %zmm1, %zmm0
	vaddpd	%zmm0, %zmm2, %zmm0
	vaddpd	%zmm0, %zmm3, %zmm0
	vextractf64x4	$1, %zmm0, %ymm1
	vaddpd	%zmm1, %zmm0, %zmm0
	vextractf128	$1, %ymm0, %xmm1
	vaddpd	%zmm1, %zmm0, %zmm0
	vpermilpd	$1, %xmm0, %xmm1 # xmm1 = xmm0[1,0]
	vaddpd	%zmm1, %zmm0, %zmm0
	cmpq	%rdi, %r8
;}
; Location: REPL[4]:3
	je	L292
L259:
	addq	$-1, %rsi
	nopw	(%rax,%rax)
; Location: REPL[4]:4
; Function getindex; {
; Location: array.jl:739
L272:
	vmovsd	(%rdx,%rsi,8), %xmm1    # xmm1 = mem[0],zero
;}
; Function +; {
; Location: float.jl:395
	vfmadd231sd	(%rcx,%rsi,8), %xmm1, %xmm0
;}
; Function iterate; {
; Location: range.jl:591
; Function ==; {
; Location: promotion.jl:403
	addq	$1, %rsi
	cmpq	%rsi, %r8
;}}
	jne	L272
; Location: REPL[4]:6
L292:
	addq	$40, %rsp
	vzeroupper
	retq
; Location: REPL[4]:3
; Function eachindex; {
; Location: abstractarray.jl:207
; Function eachindex; {
; Location: abstractarray.jl:218
L300:
	movabsq	$jl_system_image_data, %rax
	movq	%rax, 8(%rsp)
	movabsq	$jl_system_image_data, %rax
	movq	%rax, 16(%rsp)
	movq	%rdi, 24(%rsp)
	movq	%rsi, 32(%rsp)
	movabsq	$jl_invoke, %rax
	movabsq	$140450899053840, %rdi  # imm = 0x7FBD45F24D10
	leaq	8(%rsp), %rsi
	movl	$4, %edx
	callq	*%rax
	ud2
	nopw	%cs:(%rax,%rax)
;}}}

All the zmm registers you are 512 bit registers. ymm is 256, and xmm is 128.
The main loop body is:

L128:
	vmovupd	(%rdx,%rax,8), %zmm4
	vmovupd	64(%rdx,%rax,8), %zmm5
	vmovupd	128(%rdx,%rax,8), %zmm6
	vmovupd	192(%rdx,%rax,8), %zmm7
;}
; Function +; {
; Location: float.jl:395
	vfmadd231pd	(%rcx,%rax,8), %zmm4, %zmm0
	vfmadd231pd	64(%rcx,%rax,8), %zmm5, %zmm1
	vfmadd231pd	128(%rcx,%rax,8), %zmm6, %zmm2
	vfmadd231pd	192(%rcx,%rax,8), %zmm7, %zmm3
	addq	$32, %rax
	cmpq	%rax, %rdi
	jne	L128

L128 4 registers from memory (a total of 32 doubles) of one array, and then it uses fused multiply-add instructions (vfmadd) to multiply numbers from another region of memory (the other array) with the just loaded values, and add them to accumulated dot products.
Besides this making the dot product dramatically faster, you can also compare this with pairwise summation – it should also be more accurate than the naive algorithm.

It is fast:

julia> using BenchmarkTools

julia> @btime dot_product($x, $y)
  12.979 ns (0 allocations: 0 bytes)
4.738430453861962

julia> @btime $x' * $y
  25.089 ns (0 allocations: 0 bytes)
4.738430453861962

For comparison, on my similarly-clocked Ryzen, the minimum time is about 34ns.

If you’d like a little more involved example, I recently been working on a problem that involves a mixture of zero-mean tri-variate T distributions. A Gibbs sampler is straightforward to implement for this problem; here is the function calculating the unnormalized conditional probabilities of group membership:

using Random, BenchmarkTools, SpecialFunctions
function update_individual_probs_v1!(mt::MersenneTwister, probabilities::AbstractMatrix{T},  baseπ,
                                    Li::AbstractMatrix{T}, ν, x::AbstractMatrix{T}, ::Val{NG}) where {T,NG}
   
    
    @inbounds for g ∈ 1:NG
        Li11 = Li[1,g]
        Li21 = Li[2,g]
        Li31 = Li[3,g]
        Li22 = Li[4,g]
        Li32 = Li[5,g]
        Li33 = Li[6,g]
        νfactor = (ν[g] - 2) / ν[g]
        exponent = T(-1.5) - T(0.5) * ν[g]
        base = log(baseπ[g]) + log(Li11) + log(Li22) + log(Li33) +
                    lgamma(-exponent) - lgamma(T(0.5)*ν[g]) - T(1.5)*log(ν[g])
        for i ∈ 1:size(probabilities,1)
            lx₁ = Li11 * x[i,1]
            lx₂ = Li21 * x[i,1] + Li22 * x[i,2]
            lx₃ = Li31 * x[i,1] + Li32 * x[i,2] + Li33 * x[i,3]
            probabilities[i,g] = exp(base + exponent * log(one(T) + νfactor * (lx₁*lx₁ + lx₂*lx₂ + lx₃*lx₃)))
        end
    end
end
function update_individual_probs_v1!(probabilities::AbstractMatrix{T},  baseπ,
                                    Li::AbstractMatrix{T}, ν, x::AbstractMatrix{T}, ::Val{NG}) where {T,NG}
   update_individual_probs_v1!(Random.GLOBAL_RNG, probabilities, baseπ, Li, ν, x, Val(NG)) 
end

Li is the inverse of the Cholesky factor of the covariance matrix. So we’re calculationg (Li * x)’(Li * x), for each of NG groups. Creating a simple fake data set:

T = Float32
NG = 6;
N = 1024;
X = randn(T, N,3);
probabilities = Matrix{T}(undef, N, NG);
Li = rand(T, 6,NG);
ν = T(N / NG) + 4 .+ rand(T, NG);
baseπ = rand(T, NG);
baseπ ./= sum(baseπ);

This yields:

BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     118.813 μs (0.00% GC)
  median time:      121.939 μs (0.00% GC)
  mean time:        127.839 μs (0.00% GC)
  maximum time:     195.095 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

The @simd macro does not help.

However, I wrote a macro in SLEEFwrap that (with a few assumptions) can vectorize loops, including those that contain special functions (via SLEEF):

using SIMDPirates, SLEEFwrap
using SLEEFwrap: @restrict_simd

@generated function update_individual_probs!(mt::MersenneTwister, probabilities::AbstractMatrix{T}, baseπ::AbstractVector{T}, Li::AbstractMatrix{T}, ν, x::AbstractMatrix{T}, ::Val{NG}) where {T,NG}
    quote
        @inbounds for g ∈ 1:NG
            Li11 = Li[1,g]
            Li21 = Li[2,g]
            Li31 = Li[3,g]
            Li22 = Li[4,g]
            Li32 = Li[5,g]
            Li33 = Li[6,g]
            νfactor = (ν[g] - 2) / ν[g]
            exponent = T(-1.5) - T(0.5) * ν[g]
            base = log(baseπ[g]) + log(Li11) + log(Li22) + log(Li33) +
                    lgamma(-exponent) - lgamma(T(0.5)*ν[g]) - T(1.5)*log(ν[g])
            @restrict_simd $T for i ∈ 1:size(probabilities,1)
                lx₁ = Li11 * x[i,1]
                lx₂ = Li21 * x[i,1] + Li22 * x[i,2]
                lx₃ = Li31 * x[i,1] + Li32 * x[i,2] + Li33 * x[i,3]
                probabilities[i,g] = exp(base + exponent * log(one(T) + νfactor * (lx₁*lx₁ + lx₂*lx₂ + lx₃*lx₃)))
            end
        end
    end
end
function update_individual_probs!(probabilities::AbstractMatrix{T}, baseπ,
                                    Li::AbstractMatrix{T}, ν, x::AbstractMatrix{T}, ::Val{NG}) where {T,NG}
    update_individual_probs!(Random.GLOBAL_RNG, probabilities, baseπ, Li, ν, x, Val(NG))
end

This yields:

BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     7.363 μs (0.00% GC)
  median time:      7.867 μs (0.00% GC)
  mean time:        7.990 μs (0.00% GC)
  maximum time:     16.006 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     4

For some reason the times are about 20 microseconds slowed from the REPL than from IJulia. I showed IJulia times above.
EDIT: Nevermind on that – the runtime is data-dependent. In the REPL, I was using the fake data I generated above. In IJulia, I was using more realistic fake data and processing it as I normally would before getting to that step.