How to loop in script properly? (Perhaps CUDA specific)

I have some code which is slow in the first run, for example my script is like:

function loop(x)
    # something very complicated.
end

function main()
    t = time()
    x = rand(100)
    for i in 1:3
        loop(x)
        println(time() - t)
        t = time()
    end
end

main()

I found that the first run of main() is very slow, even when the loop has stepped into the non-first time call of loop()

56.61500000953674
43.842000007629395
47.085999965667725

but if i rerun the main() in the same REPL again, it’s fast

1.872999906539917
1.8619999885559082
1.7840001583099365

But since the main function is the one for loop, and it’s very very slow in the first run, I cannot wait to run it for the first time to warm up. Since it’s actually slow for N times not just 1 time.
How to write loop in script in julia properly, so I can just julia --project=./ script.jl and it runs as fast as possible(I hope only the first time of the loop is slow)?

function loop(x)
    # something very complicated.
end

    t = time()
    x = rand(100)
    for i in 1:3
        loop(x)
        println(time() - t)
        t = time()
    end

how about this?

then it’s leaking more variable into the global scope, and use of global variable in loop, isn’t it not recommended by Julia?

Btw I tried to run it, and it shows

17.370999813079834
1.753000020980835
1.6850001811981201

Even the first time is much faster than put it in main()

Just out of curiosity, loop(x) is self contained or it uses global variables?

it uses variables defined in main(), so no global variables

Another question is that there must be many other functions involving loops, so in those functions, the whole loop is slow for the first time run of the function?
And because the time is constantly over 40s, it doesn’t look like a compile time for me…

Can you explain why put it into the main could make things so slow?
(56s vs 17s and following 43s vs 1.8s)

I tried to put the loop length into the parameter of the main, and strangely, the loop time is now 10s per loop, and will not speed up to 1s for the second run…

I tried to write a MME but I found that with simple function it’s hard to replicate this problem. I don’t know what causes this, in which case a extra slow compilation will happen, and in which case the for loop is slowed down permanently… I checked common problems like type stability, global const problem, it doesn’t seem to be the case here.

Btw maybe I should mention that this loop function contains complicated CUDA function and custom cuda kernels?

oh well then you should ask CUDA people

1 Like
using CUDA
using Statistics

function some_func(x, y)
    a = CUDA.rand(1000)
    b = CUDA.rand(1000)
    c = a .* b
    return c .* x .* y
end

function main()
    result = []
    count = 0
    for i in 1:3
        t = time()
        for j in 0:20
            for k in 1:20
                count += 1
                for m in 1:20
                    a = CUDA.rand(1000)
                    b = CUDA.rand(1000)
                    c = some_func(a, b)
                    push!(result, mean(c))
                end
            end
        end
        println(time() - t)
    end
end

main()

Here I did successfully replicated the problem with a MME, the timing is:
First run:

42.087000131607056
33.35199999809265
34.365999937057495

Second run

1.1009998321533203
0.8480000495910645
1.4190001487731934

Without main()

10.562000036239624
0.935999870300293
1.316999912261963

Changed title to attract CUDA experts.

This may be a regression. On CUDA.jl v3.1.0,

julia> main() # first run
29.039000034332275
1.7300000190734863
1.61899995803833

On v3.3.0,

julia> main() # first run
80.84599995613098
61.09299993515015
63.625

julia> main() # second run
1.5520000457763672
1.7239999771118164
1.375999927520752
2 Likes

xref Performance issue with complicated loops in function · Issue #984 · JuliaGPU/CUDA.jl · GitHub

1 Like