NOTE FROM THE FUTURE: @roflmaostc noted that I forgot to synchronize the GPU call, leading to timing only the kernel launch . They have included this synchronization (
CUDA.@sync $xGPU + $yGPU
) and provided the correct benchmarking code and plot here.
==============================================================
A quick and dirty CPU vs GPU benchmark! Using the (free!) “T4 GPU” runtime. Stopped at 10^8 because 10^9 exceeded either the 12.7GB of RAM or 15.0GB of VRAM.
using Pkg
Pkg.add("CUDA")
Pkg.add("BenchmarkTools")
using Plots, CUDA, BenchmarkTools
pMax = 8
powerVector = 1:pMax
timeVectorCPU = Vector{Float16}(undef, pMax)
timeVectorGPU = Vector{Float16}(undef, pMax)
for p in powerVector
n = 10^p
xCPU, yCPU = (ones(n), ones(n))
xGPU, yGPU = (cu(xCPU), cu(yCPU))
timeVectorCPU[p] = @belapsed $xCPU + $yCPU
timeVectorGPU[p] = @belapsed $xGPU + $yGPU
end
timeVectorCPU |> display
timeVectorGPU |> display
plot(
10 .^ powerVector,
[timeVectorCPU timeVectorGPU],
label = ["CPU" "GPU"],
title = "CPU vs GPU",
xscale = :log10,
yscale = :log10,
ylabel = "Elapsed time [s]",
fmt = :png
)