Thanks! The package is very interesting.
[…] I am probably leaning to change the parallelization strategy at some point.
Are you planning to support GPUs? if so only CUDA or other manufacturers as well?
I had tested it in some other machine, but I don’t have access to that kind of number of processors.
I run it on a 2 sockets machine with quite similar proportions between the number of CPUs as presented in your example. I thought that it might be useful.
[…] I still didn’t find a solution (or the cause), […]
I am enclosing the code that I used to prepare the results. I did some reading on BenchmarkTools.jl, TimerOutputs.jl and @profile. I am wondering if it be useful to utilize more advanced options of those tools? Or maybe you have any suggestion on how to transfer @btime and @time outputs into a dataframe in order to prepare the charts? I will be happy to try to adjust the code or if you find it suitable and you decide to provide update with examples, to run the tests again.
I will look carefully to your results. Let me know if you discover or suspect of anything.
As for the software side, I do not have enough knowledge to analize it in detail, even less about compilers. On a more general level, one thing that came to my mind was OpenBLAS. When I was using other packages, particularly AlphaZero.jl, I spotted that sometimes the results are sensitive to particular BLAS library, the way julia is run with this library or the output depends on the “particular load” (tbc).
As for the hardware, I guess I can try to elaborate on the CPU design, L1, L2, L3 caches and some peculiarities of memory access and scheduling, however, I will risk it … … maybe @tkf would be interested and willing to provide some insights? Seriously, I guess that it would be very interesting.
Again, I am very happy to do some additional tests and to prepare results within the limits of my knowledge should there be such a need.