AlphaGPU: an alphazero implementation wholly on gpu

For those interested i made a fully cuda implementation of alphazero for connect 4.
It can train a network to a very strong level in 2 to 3 hours on a single GPU, thanks to being able to play 32000 games in parallel.
It probably can be considerably optimized, as this is the first kernel i ever wrote. You can find everything in GitHub - fabricerosay/AlphaGPU: Alphazero on GPU thanks to CUDA.jl


How does speed change with number of games? I’m, this is even cooler if it can get 80% speed when running 8000 concurrent games, since then you could couple it with distributed training for even more speed.

I’m not sure to understand. It can play 32000 games in parallel, depending on the number of rollouts and the size of the network it takes between 1 minute for one full iteration (small network like 500 000 weights, 64 rollouts, training on 2 millions of samples per iteration): 32K games, training and piting new vs old network and up to around 20-30 minutes (millions of weights and 512 rollouts).
The small network and 64 rollout are enough to train a very good agent ie on par with Alphazero.jl (7% error on the begin hard set, 3-4% on middle hard with 600 rollouts but if you compare at fixed time you can make many more rollouts as the network evaluation is many times faster than CNN at the expanse of worst generalization).

This is amazing and it is really a testament to the power of CUDA.jl. Have you considered writing a more detailed blog article about it? I think many people would find it a great read.

Now, the real question to me is: what would it take to get a game-agnostic implementation? Would it be possible to implement a generic version of MCTS that runs on GPU? And if so, what would be a GPU-friendly game API for users to write their own environments in? If we can find nice answers to these questions, I would be really interested in implementing a full-GPU mode for AlphaZero.jl.


I’m working on making a more generic implementation, but this one actually nearly is.
Everything specific to connect4 is:
A struct Position ( state of game)
A function canPlay(pos,c) returning true if you can play at c
A function play(pos,c) that actualize the game state after playing at c
A function isOver(pos) returning a tuple bool,winner
Then you have to change a bunch of constant (max length of the games, maxactions etc which were not written genericaly), and it should work.
The only pain is that the struct Position shouldn’t use Array or stuff like that (well theoretically you could but then it’s a bit tricky) so the easier way is to use bitboard. Ideally your struct should be isbits, but it is possible to make other things work.
The other solution who could prove better would be to turn everything into huge arrays. this also work but is not so easy to make generic though i belive far from impossible. I suppose that it could be faster, but ironically my first implementation was faster than this one. Also i don’t know anything about CUDA so it’s hard to optimize things. for example the calcul of the policy involves a loop and should be way faster using shared memory, i tried and it turned slower. Maybe using a RTX 3080, render this trick useless or maybe other parts are so badly coded that they are bottleneck.
The other problem is that it won’t necesserely scale( Boardlaw the implementation in C++/Cuda that i copied works till 9x9 for hex and struggle on 11x11), yet it allows to very quickly test ideas on connect4 as in around one hour you can produce very good nets, so it could prove usefull for searchers .
end of the wall.


Made some changes to AlphaGPU to make it make it more generic.
Right now i removed connect4 and added Gobang any size (up to 13x13) with a new bitboard struct that allow to code your own games.
Also made a few changes to the code, making it slightly faster and able to train bigger games, though it tends to slow down a lot with bigger games.
Also with time each iteration seems to take longer and longer which is a problem every Alpha like implementation seems to have. CUDA allocator though way better seems to struggle a bit when facing fast loops, or Flux or Both :slight_smile:


Make sure you’re using CUDA.jl 3.0+ on a CUDA 11.2+ compatible driver (check CUDA.versioninfo()).

I thinks that is the case:

CUDA toolkit 11.2.2, artifact installation
CUDA driver 11.2.0
NVIDIA driver 460.73.1


  • CUBLAS: 11.4.1
  • CURAND: 10.2.3
  • CUFFT: 10.4.1
  • CUSOLVER: 11.1.0
  • CUSPARSE: 11.4.1
  • CUPTI: 14.0.0
  • NVML: 11.0.0+460.73.1
  • CUDNN: 8.20.0 (for CUDA 11.3.0)
    Downloaded artifact: CUTENSOR
  • CUTENSOR: 1.3.0 (for CUDA 11.2.0)


  • Julia: 1.6.0
  • LLVM: 11.0.1
  • PTX ISA support: 3.2, 4.0, 4.1, 4.2, 4.3, 5.0, 6.0, 6.1, 6.3, 6.4, 6.5, 7.0
  • Device support: sm_35, sm_37, sm_50, sm_52, sm_53, sm_60, sm_61, sm_62, sm_70, sm_72, sm_75, sm_80

1 device:
0: GeForce RTX 3080 (sm_86, 275.500 MiB / 9.780 GiB available)

Then it’s probably increasing GC times. You can try calling CUDA.unsafe_free! to explicitly free up data when it’s not used anymore to reduce memory pressure (and then hopefully avoid invocations of the GC).

Thanks for the hint @ CUDA.unsafe_free! , I will definitively try it.

I updated the repository of AlphaGPU, I added 4 in a row again and reversi 6X6, all with scripts to launch training, making it easier to test.
The most important parameter to tweak are cpuct and noise exploration , those are game changer for a success full run.

WHOLLY s***. Just saw this. wonder how well it works.

Last addition Hex any size up to 13x13, because I’m lazy. On 7x7 board it can beat Mohex when starting after only 1,5 hours of training. There’s a script to manually test the net with a beautifull HexGrid thanks to Luxor.jl (that probably works only from Atom).
It seems that Hex is way easier than Reversi or Gobang for Alphazero, which seems surprising to me.