Hey,
I found this interesting phenomenon, can anyone explain the situation?
using BenchmarkTools
BenchmarkTools.DEFAULT_PARAMETERS.seconds = 2.50
using CUDA
arit100(x::AbstractArray, y::AbstractArray) = for i=1:100 x .+= x .* y end
arit200(x::AbstractArray, y::AbstractArray) = for i=1:200 x .+= x .* y end
x_cu = CUDA.randn(Float32,30_000_000)
y_cu = CUDA.randn(Float32,30_000_000)
@btime arit100($x_cu, $y_cu)
@btime arit200($x_cu, $y_cu)
Results:
515.628 μs (5600 allocations: 276.56 KiB)
270.228 ms (11200 allocations: 553.13 KiB)
Instead of linear scaling (5ms vs 270ms) there is a huge 50x drop in speed.
I am working on an NVidia GTX 1050 4GB videocard.
What cause this speed drop and how to avoid it?
I rerun it on 1080 TI and I had to expand the test:
using BenchmarkTools
BenchmarkTools.DEFAULT_PARAMETERS.seconds = 2.50
using CUDA
arit50(x::AbstractArray, y::AbstractArray) = for i=1:50 x .+= x .* y end
arit100(x::AbstractArray, y::AbstractArray) = for i=1:100 x .+= x .* y end
arit200(x::AbstractArray, y::AbstractArray) = for i=1:200 x .+= x .* y end
arit400(x::AbstractArray, y::AbstractArray) = for i=1:400 x .+= x .* y end
arit800(x::AbstractArray, y::AbstractArray) = for i=1:800 x .+= x .* y end
x_cu = CUDA.randn(Float32,30_000_000)
y_cu = CUDA.randn(Float32,30_000_000)
@btime arit50($x_cu, $y_cu)
@btime arit100($x_cu, $y_cu)
@btime arit200($x_cu, $y_cu)
@btime arit400($x_cu, $y_cu)
@btime arit800($x_cu, $y_cu)
My results:
216.893 μs (2800 allocations: 138.28 KiB)
373.034 μs (5600 allocations: 276.56 KiB)
744.082 μs (11200 allocations: 553.13 KiB)
3.372 ms (22400 allocations: 1.08 MiB)
232.840 ms (44800 allocations: 2.16 MiB)
This speed drop is pretty interesting. (It wouldn’t be fun if you meet with this in production on a big dataset. )
I don’t think this is caching due to the for cycle.
I will follow this thread as it sounds really interesting and probably could write an issue to CUDA.jl github…
You need to do @btime CUDA.@sync stuff...
otherwise you’re mainly measuring the time it takes to launch a kernel, not for it to compute. When I do that, I get timings that scale linearly with the length of the for loop, as expected.
4 Likes
In this case, it scales linearly for me:
46.216 ms (2810 allocations: 138.45 KiB)
93.765 ms (5610 allocations: 276.73 KiB)
189.303 ms (11210 allocations: 553.30 KiB)
379.951 ms (22410 allocations: 1.08 MiB)
764.581 ms (44810 allocations: 2.16 MiB)
Thanks!
I am not sure how to calculate, it but in this case, the speed would be something like: 30_000_000 * 800 * 2 / 0.764581 =~ 62GFlops on 1080TI.
Sorry, indeed!
Thank you for the tips and the measurement.
The exact timings with CUDA.@sync
182.634 ms (2810 allocations: 138.45 KiB)
362.649 ms (5610 allocations: 276.73 KiB)
726.573 ms (11210 allocations: 553.30 KiB)
1.581 s (22410 allocations: 1.08 MiB)
3.050 s (44810 allocations: 2.16 MiB)
1050 Ti here.
Anyway, what could be the explanation of that speed drop in that case when it measured the kernel launch?