I want to learn testing Julia code and the best way to do this that I see, I find some untested project and try to write test for it. I can’t promise anything, my spare time is quite fluid thing, but I search for some good project to start with.
Cool. Should it be to your interest and you might be willing to join forces please let me know, here or by a direct message. If such a case, I guess that it might be justifiable to ask for some advice with regard to the topic of profiling and testing. I have been in touch with persons who I consider as having deep knowledge in those areas and I am willing to contact them and ask if they might be interested in providing general advice.
As for books, there is for example “Reinforcement Learning: An Introduction” by Richard S. Sutton and Andrew G. Barto, The MIT Press 2020 and “Algorithms for Decision Making” by Mykel J. Kochenderfer, Tim A. Wheeler, Kyle H. Wray, Massachusetts Institute of Technology 2022 (draft). There is also interesting book titled “From Shortest Paths to Reinforcement Learning” by Paolo Brandimarte, Springer Nature Switzerland AG 2021.
I am not a professional coder, however, in case of your interest I will be happy to provide you with some additional info about what I have done so far (all was very preliminary). In general, the AlphaZero training consists of 4 stages and 15 iterations. The first stage of one iteration takes: a) 1h:16m:12s on 2 x Intel Xeon Gold 6348 (56 cores / 112 threads) and 250GB RAM; b) 2h:09m:45s on Ampere Altra Neoverse N1 (160 cores) and 1 TB RAM and c) 1h:08m:31s on 2 x AMD EPYC 7543 (64 cores / 128 threads) and 2 x NVIDIA V100 and 512GB RAM. It should be noted that the tests were done on v. 1.6 and according to my understanding Neoverse N1 is not fully supported by Julia 1.6. Also as for V100 only one GPU was utilized during the training and the load was only at about 30%.
I will be very frank with you, I do not know what is @jonathan-laurent current understanding. However, my guess is that AlphaZero.jl could maybe potentially run at least just slightly a little bit faster (also only on CPUs). However, again, I would like to strongly underline that it is only my guess. I would also like to add that my interest in this topic is related to a hobby project that I am working on with a focus on “Neural network analysis of uncertainty and sensitivity of deterministic and probabilistic models in conjunction with quantum computing approaches to shortest path optimization algorithms based on geometric algebra”.
I guess so. Most people here are very busy 24/7, it just how the things are. You write previously about things that you can help me in such regard, but I don’t understand fully what you mean. Maybe you can explain it again?
Sure, you wrote that you “search for some good project to start with” in order to “learn testing in Julia by practice” thus was my suggestion about @jonathan-laurent’s AlphaZero.jl.
You write previously about things that you can help me in such regard, but I don’t understand fully what you mean. Maybe you can explain it again?
Sure, however, what are you implying if I may ask? Are you maybe implying that the machines are not performant enough for AlphaZero.jl toy problem examples? If such a case, we can try to connect them together or maybe even start with some good network fabric from the very beginning. As for the GPUs, again if such a case, I admit that you might be right. It’s what I had handy at that time. Significantly better hardware might require signing of non-disclosure agreements. If such a need, I will take it into a consideration. Please be informed that in general I like the idea and a friend of mine recently suggested possibility of testing of some unreleased hardware. I would have to think it over.
I think it was very simple and straightforward. My understanding is that you are looking to “learn testing in Julia by practice”. I am interested in AlphaZero.jl package as I outlined above. I have been taking part in testing of AlphaZero.jl and as I understand, during the time of those tests, there was a suggestion and even a new software was created to perform more in depth tests, particularly tests related to visualizations of the timeline of the inference server. Due to some reasons (mostly due to the fact that I have limited amount of time) I put AZ.jl testing away for a moment, however, I am interested to perform more of those tests. I have done some preparations, also to do tests related to distributed computing, MPI and a new package called MPItrampoline. My understanding is that testing of @jonathan-laurent’s toy problem examples are a little bit demanding and are taking quite a little bit of time so if this is to your interest I understand that your involvement could be useful. Coding is not my area of expertise and as I understand you mentioned the process of learning thus was my suggestion to contact persons with deep knowledge in the field of testing and profiling and ask for general advice.
I believe that proposing new tests for existing Julia projects is an excellent way to learn more about the language and contribute to the Julia ecosystem. I would of course welcome any such contribution to my AlphaZero.jl package. Indeed, only two thirds of the codebase are currently covered by tests.
Regarding performance testing (in contrast with correctness testing), there is work to be done on AlphaZero.jl, although right now is probably not the best time to contribute. Indeed, I am already aware of the source of many performance issues (e.g. over-reliance on the GC to free memory / memory contention issues due to having too many non-cooperative threads…) and planning to work on them. Once this work is done, I will be happy to have other people run new performance benchmarks and I’ll provide assistance for interested people to do so.
Those few following sentences are absolutely not to complain about the performance issues of AlphaZero.jl. Indeed, AFAIK, there are other Julia implementations that might be more performant, however, AFAIK, neither of them is as complete also in terms of provided documentation. Just wanted to point out that the package seems to be performance (stress) test on its own whether to Julia Programming Language (ecosystem) as well as to the hardware (whether to CPUs or GPUs), thus making it sometimes inaccessible for more advanced substantive / correctness tests or even for basic familiarization. In my opinion, it is a very good case for somebody who wants to “learn testing in Julia by practice” and at the same time to have contact will some of the cutting edge / close to the cutting edge scientific developments, thus was my suggestion expressed towards @KZiemian.
@jonathan-laurent Hey, so are you still interested in those profiling and optimization data or is it not valid anymore? I was planning to do some of those tests, however, after new info provided by you decided to ask as it is currently somehow unclear.
Actually, it would definitely be useful if you could collect a debugging timeline on one of the machines on which you observed disappointing performances and share it as a JSON file. This is the one thing I am most interested in at the moment.
@KZiemian
I know that there is going to be an online developer event on or around pi day (March 14) with a machine learning component that I am planning to attend (tbc). Should you guys be interested to join, it would be my pleasure. Its organized by my home company (I am not affiliated with this company, home in a sense that the company is a place where in my early days I’ve been gaining my first professional experiences; the company is cool, among different things they are running cloud data centers with a focus on HPC).
Actually, it would definitely be useful if you could collect a debugging timeline on one of the machines on which you observed disappointing performances and share it as a JSON file. This is the one thing I am most interested in at the moment.
Aren’t you a little bit too harsh for yourself? IMO, the code is running really good. I did not expect it to be possible to train default examples on CPUs. It was a big, positive surprise for me.
Which architecture would be the most interesting for you? I would be most interested to do testing on x86. I do not have currently access to this machine with 2 x NVIDIA V100 and I am currently in a process of transition to GPU powered machine. Also as for Arm, my current setup consist of half of the previously mentioned cores (80).
If you look at “A Survey of Deep Learning on CPUs: Opportunities and Co-optimizations” by Sparsh Mittal, Poonam Rajputy and Sreenivas Subramoney or at similar papers, CPUs seem to look like a viable and cost effective option for some DL trainings. Also if you look at the paper “Comparing Julia to Performance Portable Parallel Programming Models for HPC” by Wei-Chen Lin and Simon McIntosh-Smith (at page 10, paragraphs 4, 5 and 8) it says that Julia is encountering difficulties on some Arm platforms. Thus again, as for now, my suggestion would be x86.
Last but not least, would you consider providing a clear and firm code snippet to be used for such tests so they could be accessible and reproduceable? Should you take it into a consideration, would you also consider making it potentially CLI friendly. AFAIK, currently AlphaZero.jl software related to visualization of the timeline of the inference server requires Xorg which makes CLI only operations maybe not impossible but sometimes somehow difficult.