CUDA Speed drop

Marcell_Havlik · August 23, 2020, 11:31am

Hey,

I found this interesting phenomenon, can anyone explain the situation?

using BenchmarkTools
BenchmarkTools.DEFAULT_PARAMETERS.seconds = 2.50
using CUDA

arit100(x::AbstractArray, y::AbstractArray) = for i=1:100 x .+= x .* y end
arit200(x::AbstractArray, y::AbstractArray) = for i=1:200 x .+= x .* y end

x_cu = CUDA.randn(Float32,30_000_000)
y_cu = CUDA.randn(Float32,30_000_000)
@btime arit100($x_cu, $y_cu)
@btime arit200($x_cu, $y_cu)

Results:

  515.628 μs (5600 allocations: 276.56 KiB)
  270.228 ms (11200 allocations: 553.13 KiB)

Instead of linear scaling (5ms vs 270ms) there is a huge 50x drop in speed.
I am working on an NVidia GTX 1050 4GB videocard.
What cause this speed drop and how to avoid it?

Sixzero · August 23, 2020, 12:26pm

I rerun it on 1080 TI and I had to expand the test:

using BenchmarkTools
BenchmarkTools.DEFAULT_PARAMETERS.seconds = 2.50
using CUDA

arit50(x::AbstractArray, y::AbstractArray) = for i=1:50 x .+= x .* y end
arit100(x::AbstractArray, y::AbstractArray) = for i=1:100 x .+= x .* y end
arit200(x::AbstractArray, y::AbstractArray) = for i=1:200 x .+= x .* y end
arit400(x::AbstractArray, y::AbstractArray) = for i=1:400 x .+= x .* y end
arit800(x::AbstractArray, y::AbstractArray) = for i=1:800 x .+= x .* y end

x_cu = CUDA.randn(Float32,30_000_000)
y_cu = CUDA.randn(Float32,30_000_000)

@btime arit50($x_cu, $y_cu)
@btime arit100($x_cu, $y_cu)
@btime arit200($x_cu, $y_cu)
@btime arit400($x_cu, $y_cu)
@btime arit800($x_cu, $y_cu)

My results:

  216.893 μs (2800 allocations: 138.28 KiB)
  373.034 μs (5600 allocations: 276.56 KiB)
  744.082 μs (11200 allocations: 553.13 KiB)
  3.372 ms (22400 allocations: 1.08 MiB)
  232.840 ms (44800 allocations: 2.16 MiB)

This speed drop is pretty interesting. (It wouldn’t be fun if you meet with this in production on a big dataset. )
I don’t think this is caching due to the for cycle.

I will follow this thread as it sounds really interesting and probably could write an issue to CUDA.jl github…

marius311 · August 23, 2020, 12:37pm

You need to do @btime CUDA.@sync stuff... otherwise you’re mainly measuring the time it takes to launch a kernel, not for it to compute. When I do that, I get timings that scale linearly with the length of the for loop, as expected.

Sixzero · August 23, 2020, 1:56pm

In this case, it scales linearly for me:

  46.216 ms (2810 allocations: 138.45 KiB)
  93.765 ms (5610 allocations: 276.73 KiB)
  189.303 ms (11210 allocations: 553.30 KiB)
  379.951 ms (22410 allocations: 1.08 MiB)
  764.581 ms (44810 allocations: 2.16 MiB)

Thanks!
I am not sure how to calculate, it but in this case, the speed would be something like: 30_000_000 * 800 * 2 / 0.764581 =~ 62GFlops on 1080TI.

Marcell_Havlik · August 23, 2020, 8:28pm

Sorry, indeed!

Thank you for the tips and the measurement.

The exact timings with CUDA.@sync

  182.634 ms (2810 allocations: 138.45 KiB)
  362.649 ms (5610 allocations: 276.73 KiB)
  726.573 ms (11210 allocations: 553.30 KiB)
  1.581 s (22410 allocations: 1.08 MiB)
  3.050 s (44810 allocations: 2.16 MiB)

1050 Ti here.

Anyway, what could be the explanation of that speed drop in that case when it measured the kernel launch?

Topic		Replies	Views
Performance comparison of Nvidia A100, V100, RTX2080Ti Performance gpu , cuda	17	5429	June 14, 2021
CUDA v2 - performance regression on matrix multiplication GPU	14	1766	November 10, 2020
Timing square function in CUDA GPU	4	1706	December 11, 2018
CUDAnative , performance drop after several timesteps Performance question , gpu , cudanative , cuda	5	1189	September 8, 2019
What is the optimal way of updating CuArray? GPU cudanative	7	1543	July 5, 2018

CUDA Speed drop

Related topics