I found this interesting phenomenon, can anyone explain the situation?
using BenchmarkTools
BenchmarkTools.DEFAULT_PARAMETERS.seconds = 2.50
using CUDA
arit100(x::AbstractArray, y::AbstractArray) = for i=1:100 x .+= x .* y end
arit200(x::AbstractArray, y::AbstractArray) = for i=1:200 x .+= x .* y end
x_cu = CUDA.randn(Float32,30_000_000)
y_cu = CUDA.randn(Float32,30_000_000)
@btime arit100($x_cu, $y_cu)
@btime arit200($x_cu, $y_cu)
Instead of linear scaling (5ms vs 270ms) there is a huge 50x drop in speed.
I am working on an NVidia GTX 1050 4GB videocard.
What cause this speed drop and how to avoid it?
I rerun it on 1080 TI and I had to expand the test:
using BenchmarkTools
BenchmarkTools.DEFAULT_PARAMETERS.seconds = 2.50
using CUDA
arit50(x::AbstractArray, y::AbstractArray) = for i=1:50 x .+= x .* y end
arit100(x::AbstractArray, y::AbstractArray) = for i=1:100 x .+= x .* y end
arit200(x::AbstractArray, y::AbstractArray) = for i=1:200 x .+= x .* y end
arit400(x::AbstractArray, y::AbstractArray) = for i=1:400 x .+= x .* y end
arit800(x::AbstractArray, y::AbstractArray) = for i=1:800 x .+= x .* y end
x_cu = CUDA.randn(Float32,30_000_000)
y_cu = CUDA.randn(Float32,30_000_000)
@btime arit50($x_cu, $y_cu)
@btime arit100($x_cu, $y_cu)
@btime arit200($x_cu, $y_cu)
@btime arit400($x_cu, $y_cu)
@btime arit800($x_cu, $y_cu)
This speed drop is pretty interesting. (It wouldn’t be fun if you meet with this in production on a big dataset. )
I don’t think this is caching due to the for cycle.
I will follow this thread as it sounds really interesting and probably could write an issue to CUDA.jl github…
You need to do @btime CUDA.@sync stuff... otherwise you’re mainly measuring the time it takes to launch a kernel, not for it to compute. When I do that, I get timings that scale linearly with the length of the for loop, as expected.