Hello Julia Community!
My name is Andreas Spanopoulos, and I am one of the JSoC-funded students. This past summer, I worked on implementing a simple, hackable, fully documented and full-GPU implementation of AlphaZero in Julia. Guided by my mentor, and original author of AlphaZero.jl, Jonathan Laurent, we achieved a remarkable 8x speedup over the previous version of AlphaZero.jl! More on our work can be found here.
Some of the key features and advancements of this redesign are:
- Code Quality: We prioritized modularity, readability, and extensibility to ensure that the codebase serves as a robust foundation for future developments and contributors.
- Documentation and Testing: All main components have been thoroughly documented and tested.
- Device Agnosticism: One of the highlights of this project is that AlphaZero, and all its sub-components (like MCTS, environment simulation, and neural network training), can run seamlessly on either CPU or GPU. This feat was achieved without resorting to custom GPU kernels or GPU-specific constructs, thanks to GPUCompiler.
One bottleneck of running parallel MCTS with a GPU, is the Neural Network (NN) evaluation, when environment states on the CPU have to be transferred to the GPU to be forwarded to the NN. To tackle this overhead, boardlaw and AlphaGPU run MCTS fully (tree search and NN evaluation) on GPU, thus avoiding data transfers. The latter has been implemented in Julia, and makes use of custom CUDA kernels and GPU-specific code to achieve its results. Our work draws inspiration from AlphaGPU in the sense that we also implement a full-GPU version of AlphaZero, with an additional focus on code readability and modularity, and without writing custom CUDA kernels.
To be able to avoid writing custom kernels, we employed the use of:
a) Broadcasting operations in CuArrays (using the dot-notation).
map() on a CuArray to run code inside the do-block on the GPU (while satisfying GPU-friendliness constraints).
The former can be used to run any broadcastable operation on the GPU, while the latter can be used to run tree search or environment simulation code in parallel. This has been made possible with the usage of CUDA.jl, which in turn uses GPUCompiler.jl. While it’s true that crafting custom CUDA kernels with GPU-specific constructs like shared memory could potentially offer further performance gains, most of the code is not bottlenecked in this regard, and this would significantly ramp up complexity. The key is minimizing CPU-GPU data transfers, and that’s exactly what we’ve managed to do with the aforementioned points.
Working with Julia was quite instructing. From its vibrant community to the range of open-source projects and the power of the VSCode extension, the experience has been overwhelmingly positive. If there’s one area for improvement, it’s that error messages in GPU programming via GPUCompiler can sometimes be less than helpful. But, overall, Julia has proved to be incredibly efficient for the task at hand.
I’d like to conclude with a shoutout to the potential of Julia in the realm of Reinforcement Learning (RL). Despite RL not being as popular in the Julia ecosystem as in others, the ability to write device-agnostic code without sacrificing performance showcases Julia’s unique capabilities. Data-generation is usually the bottleneck in many RL applications, and the fact that this issue can be dealt with in Julia without resorting to C/Cython/Numba/Jax, makes it a powerhouse.
Lastly, I’d like to thank the Julia team for giving me access to cyclops to run my experiments, as well as funding this project. The latter means the world to me, as I’ll use those funds to support my postgraduate studies. I’d like to also thank my mentor, Jonathan, for all the knowledge he passed on me during these 3 months, as well as his promptness and availability to help me despite him having a very tight schedule.
If someone would like to get in touch with me regarding this project, my email and LinkedIn profile can be found on my github. Don’t be afraid to reach out!