Reseting Device

Is there any way to reset device state, or at least wipe memory allocated, but keeping already compiled kernel ?
I keep seeing slow down during iterated call to CUDA and those disapear if i stop and rerun the execution. But as it is expensive to recompile everything, it is a waste of time to do so. I still think it’s about how the memory is handled but it is very difficult to point how exactly.
I tried running without allocator just in case but it’s obviously a no no. Maybe it is possible to get hands on host memory directly as in C++, I know it’s against the law, but in my particular use case I would like to try it if it’s possible ?

CUDA.jl has device_reset!.

Thanks for the answer but this doesn’t work with 11.2 and the use of allocator.
You have to use another Pool, and then things are slowed down by the pool I believe.

Correct, that’s due to a bug in CUDA. So update your driver :slightly_smiling_face:

I updated my drivers to latest and everything worked, but it had no effect on the speed.
Is it possible to create Device array that would not be managed by GC? I would like to avoid GC triggering as much as possible.
I was thinking something like that:

function newCu(N)
	buf=Mem.alloc(CUDA.Mem.DeviceBuffer,32*prod(N))
    A=unsafe_wrap(CuArray, convert(CuPtr{Float32}, buf), prod(N),own=false)
	A.=0f0
	return A,buf
end

But I have to admit that I don’t really understand if I’m going in the right direction, also I wonder how to destroy them when not needed anymore?

You mentioned a slowdown because of compilation, but now you’re mentioning its GC related?

Anyway, disabling the GC is not a magic bullet. And it requires you to do your own memory management, which is going to be very tricky. If you want to go that far, just use CUDA.unsafe_free! to inform CUDA.jl about allocations that can be collected, that should get you pretty far without actually doing your own memory management. Do note that this only drops the allocation’s refcount, so if you have multiple outstanding objects using that buffer – e.g. a view – calling that function on a single instance won’t do anything.

I probably was unclear about the slow down, it is still the same problem i mentioned in GC hitting hard
Changing initialization lowered memory pressure, but the slow down keep hitting. I’m pretty sure it is related to memory problem as it strikes faster when working on bigger games. So I was thinking that maybe reseting completly the device would make the next iteration as fast as the first, and I was afraid that functions would then have to be recompiled which it is not the case.
But reseting didn’t prevent slowdown to happen and it had the effect to make CUDA create a new pool wasting a lot of time so it’s definetly not a solution.
So all I’m left with is trying to manage memory manually and see if it works.
CUDA.unsafe_free! was not working on unmanaged arrays created with ‘unsafe_wrap’, what I try is to free the buffer with ‘CUDA.Mem.free’.
As a side note, I know disabling GC is not magical, but I know also that there is a problem when mixing GC pressure and CUDA.jl and I would like to find a solution, just to see if I can get my implementation as fast as the C++ one. (I also talked to Jonathan Laurent creator of AlphaZero.jl and he told me he also have memory issues).
Thank you for your help

You’re leaving out crucial details here. If you’re using unsafe_wrap to manage your own allocations, you’re not using the memory pool. As a consequence, allocations will be much more expensive, but also you end up with two competing pools which may negatively impact memory management.

It’s not easy to talk like that :slight_smile:
To be clear the original implementation rely only on safe functionality of CUDA.jl, it does not use any low level ones, only CuArrays and everything is managed by CUDA.jl. But as it is way slower than c++ counter part, It should be possible to get better, that’s what i try to do with unsafe etc but it is hard…

At a high level the problem could be stated like that: let f() be a function (calling CUDA and other stuff).
Then

for i in 1:n
   f()
end

get slower and slower at each iteration, which before seeing it, I couldn’t imagine it could happen.

But why was it slower? Without knowing that, jumping into unsafe operations that here result in issues with memory management, doesn’t seem like a good idea. Did you profile the code? Was it a specific array operation that was slow?

Well I don’t know why it gets slower. As a newbie I was not able to make profile tools work so i’m in the dark. I “profiled” with time() some portions of the code, all I’m sure of is that globally there is a slow down, and I’m pretty sure it’s related to memory issues. The slow down can happen on portions of the code that are only on cpu (i manually timed many portions of the code)
The problem is I can not make a simple iteration that would replicate the issue, maybe that’s what i should try to do.
Maybe the problem with Julia and CUDA.jl is that you made so powerful tool that anybody can use them, but the profiling tools are still reserved for clever guys…

Trying to find minimalist code to replicate i found this strange behaviour which might be the source of the problem.

function test(n)
    L=32*1024

    r=[]
    for i in 1:n
        
        working_batch=CUDA.rand(128,L)
        policy_final=CUDA.zeros(L,81)
        
        for j in 1:50
            for k in 1:64
                prob=CUDA.rand(L,81)
                synchronize()
            end
            policy,batch=Array(policy_final),Array(working_batch)    
            for k in 1:L
                push!(r,policy[k,:])
                if length(r)>2000000
                    popfirst!(r)
                end
            end
        end 
        #GC.gc()
        println(length(r))
    end
end

If you execute test(10) you will see that memory consumed explodes, it seems GC is not triggered.

If you comment the first inner loop then GC is triggered and memory usage keeps low.

If you comment the second inner loop and the line starting with policy,batch=... then no allocation is made as intended.

if you force GC with GC.gc() just before print then memory is reclaimed.

I don’t get what is happening, but it seems there is a bad interplay between CUDA and GC. Or maybe I’m fooling myself again.

1 Like

I tried various changes, none of which fixed the allocation issue, but you can get a significant speedup by transposing policy_final so that you’re taking memory-contiguous slices with policy[:,k]. As a matter of general practice, it’s ill-advised to initialize empty arrays like r = []:

julia> r = []
Any[]

julia> eltype(r)
Any

For type stability, it’s better to annotate the eltype like r = Vector{Float32}[].

I just want to mention that these types of issues have been a serious headache for AlphaZero.jl from the very start. To be fair, the situation is much better now that is used to be: I remember a time under CuArrays 1.2 where 90% of training time was spent in GC! But I am still probably leaving a 2x performances factor on the table, just because of bad memory management (AlphaZero.jl may be hit even harder that @Fabrice_Rosay’s implementation as it performs more allocations for the sake of modularity).

As was discussed in this thread, this is one of the rare places where Python’s ref-counting strategy is actually a great win. And I have never been completely satisfied by the answers provided in the thread I cited, which basically come down to “this is not a big deal in Julia as Julia makes it pretty easy not to allocate when necessary”.

I am still wondering how much of this problem could be solved simply with a better runtime and how much will always come down to having developers eliminate allocations in their code and free resources manually when needed. In the latter case, having powerful tooling to identify memory management issues and fix them strikes me as particularly important.

2 Likes

Thanks for the answer, this part of the loop in real case is not bottleneck it accounts for less than 1% of the time. And actually policy_final is calculated on the gpu and I think that it is faster to access [thread, index] than reverse order, that way thread in a warp work in close memory. The real problem is to have to manually call GC which is very slow. I don’t understand why GC can handle things when you remove the first loop which does nothing related to the second one.

policy_final is calculated on the GPU, but you’re indexing into policy[k,:], which was created on the CPU with policy = Array(policy_final), so Julia’s typical column-major memory ordering will still hold.

You’re right - these loops shouldn’t have anything to do with one another. As a band-aid, you can run the GC only when memory pressure is high:

if Sys.free_memory() < 0.2 * Sys.total_memory()
    GC.gc()
end

I just found that adding CUDA.unsafe_free(prob) after prob in the first loop and the allocation problem disappear. I think it point towards pressure on the GC coming from the CUDA part.

That’s never the intention. Did you see the documentation at Benchmarking & profiling · CUDA.jl?

But yes, a GC for GPU applications comes at a cost. But as @jonathan-laurent noted above, the situation is already massively better than it used to be. And with some appropriate compiler transformations (essentially doing unsafe_free! for you), we should be able to remove this problem for most users. In the meantime, just use unsafe_free! yourself, essentially communicating some high-level knowledge about array lifetimes to the run-time. Other workaround shouldn’t be required.

Sir! Yes Sir!
I will had unsafe_free! manually
To be serious:

  1. [quote=“Fabrice_Rosay, post:11, topic:63997”]
    Maybe the problem with Julia and CUDA.jl is that you made so powerful tool that anybody can use them, but the profiling tools are still reserved for clever guys…
    [/quote] was kinda ironic and not aimed against CUDA.jl
  2. that said, it took me 15 days to write a first version of AlphaGPU (around 4 times slower than c++ couterpart), before trying i never had written a single line of CUDA, all credits goes to CUDA.jl
  3. another 3 weeks or so to make it more generic and optimize some parts.
  4. something like 4 weeks to figure that adding this unsafe_free! add a x2 gain (which I don’t complain about, neither did I, it’s just at least imho couter intuitive and maybe it could prove useful to add this example in the section on how to relieve pressure on GC using unsafe_free!
  5. and I never managed to make any profiler work , although reading to the corresponding section of the CUDA.jl doc, but again I never said anything against your work, it is just a fact.

All in all that made me a litle bit surprise by the harsh ton of your answer, but probably I miss the irony, that happen.
“Qui bene amat, bene castigat”

Thanks for your time and your answers

I’m having a hard time understanding which parts of my comment you interpreted as “harsh”, but that obviously wasn’t the intention. I’m just trying to help you out by linking to the documentation in case you missed it, and providing some background on the issue with Julia’s GC / unsafe_free!… I’ll try adding some emoticons in the future :slight_smile:

1 Like