CUDA Speed drop

Hey,

I found this interesting phenomenon, can anyone explain the situation?

using BenchmarkTools
BenchmarkTools.DEFAULT_PARAMETERS.seconds = 2.50
using CUDA

arit100(x::AbstractArray, y::AbstractArray) = for i=1:100 x .+= x .* y end
arit200(x::AbstractArray, y::AbstractArray) = for i=1:200 x .+= x .* y end

x_cu = CUDA.randn(Float32,30_000_000)
y_cu = CUDA.randn(Float32,30_000_000)
@btime arit100($x_cu, $y_cu)
@btime arit200($x_cu, $y_cu)

Results:

  515.628 μs (5600 allocations: 276.56 KiB)
  270.228 ms (11200 allocations: 553.13 KiB)

Instead of linear scaling (5ms vs 270ms) there is a huge 50x drop in speed. :frowning:
I am working on an NVidia GTX 1050 4GB videocard.
What cause this speed drop and how to avoid it?

I rerun it on 1080 TI and I had to expand the test:

using BenchmarkTools
BenchmarkTools.DEFAULT_PARAMETERS.seconds = 2.50
using CUDA

arit50(x::AbstractArray, y::AbstractArray) = for i=1:50 x .+= x .* y end
arit100(x::AbstractArray, y::AbstractArray) = for i=1:100 x .+= x .* y end
arit200(x::AbstractArray, y::AbstractArray) = for i=1:200 x .+= x .* y end
arit400(x::AbstractArray, y::AbstractArray) = for i=1:400 x .+= x .* y end
arit800(x::AbstractArray, y::AbstractArray) = for i=1:800 x .+= x .* y end

x_cu = CUDA.randn(Float32,30_000_000)
y_cu = CUDA.randn(Float32,30_000_000)

@btime arit50($x_cu, $y_cu)
@btime arit100($x_cu, $y_cu)
@btime arit200($x_cu, $y_cu)
@btime arit400($x_cu, $y_cu)
@btime arit800($x_cu, $y_cu)

My results:

  216.893 μs (2800 allocations: 138.28 KiB)
  373.034 μs (5600 allocations: 276.56 KiB)
  744.082 μs (11200 allocations: 553.13 KiB)
  3.372 ms (22400 allocations: 1.08 MiB)
  232.840 ms (44800 allocations: 2.16 MiB)

This speed drop is pretty interesting. (It wouldn’t be fun if you meet with this in production on a big dataset. :smiley: )
I don’t think this is caching due to the for cycle.

I will follow this thread as it sounds really interesting and probably could write an issue to CUDA.jl github…

You need to do @btime CUDA.@sync stuff... otherwise you’re mainly measuring the time it takes to launch a kernel, not for it to compute. When I do that, I get timings that scale linearly with the length of the for loop, as expected.

4 Likes

In this case, it scales linearly for me:

  46.216 ms (2810 allocations: 138.45 KiB)
  93.765 ms (5610 allocations: 276.73 KiB)
  189.303 ms (11210 allocations: 553.30 KiB)
  379.951 ms (22410 allocations: 1.08 MiB)
  764.581 ms (44810 allocations: 2.16 MiB)

Thanks!
I am not sure how to calculate, it but in this case, the speed would be something like: 30_000_000 * 800 * 2 / 0.764581 =~ 62GFlops on 1080TI.

Sorry, indeed!

Thank you for the tips and the measurement.

The exact timings with CUDA.@sync

  182.634 ms (2810 allocations: 138.45 KiB)
  362.649 ms (5610 allocations: 276.73 KiB)
  726.573 ms (11210 allocations: 553.30 KiB)
  1.581 s (22410 allocations: 1.08 MiB)
  3.050 s (44810 allocations: 2.16 MiB)

1050 Ti here.

Anyway, what could be the explanation of that speed drop in that case when it measured the kernel launch?