I have successfully converted a statistical bootstrapping experiment originally written in Python. My workflow loads multiple datasets of grid data (Z sets of MxNxT) that represent numerical simulations with varying start conditions. The main methodology is for each cell (M,N) generate a series of size T by randomly sampling from all sets in Z at each value. A few statistical tests are then performed on the randomly sampled series. This is performed on every cell of the grid and repeated some number of times.
It is a simple program and the statistical tests take about 80 lines of code, mostly sampling the data, fitting some distributions, performing statistical tests, and returning data (all in 1 function). An additional method in the code loads the grids and iterates every cell, passing required information to the statistical procedure.
Because the grid is quite large, I originally was using sub processes in Python to run multiple instances of the statistical analysis simultaneously, which significantly improved runtime. Now I am using multithreading via Julia’s Threads platform.
Basically on my M1 Max, the program can do an entire grid in ~50 seconds using 10 threads (already a significant improvement from my previous program on an HPC). I also have access to Intel Xeon E5-2695v4 (Broadwell) and Intel Xeon Phi 7230 (Knights Landing) compute nodes for running this analysis. I have tried varying numbers of threads (1-36 on Broadwell and 1-256 on Knights Landing) but to no avail, the analysis takes 10 minutes (minimum), or multiple hours (if many threads are used). Does anyone have advice on how to speed it up? It would be great if the analysis was faster than my laptop in the HPC environment.
Here are some of my attempts to solve this:
- Tried various # of threads
- Set MKL_NUM_THREADS and BLAS_NUM_THREADS to 1
- Run a sample statistical analysis before beginning the multithreaded full-grid for precompilation
- Removed recursive calls in my code (if a statistical analysis fails, it re-calls itself with the next random seed. Wasn’t sure if this was a problem, but it’s not)
- Isolated line by line to determine what the slow codes are. I’m using gevfit in Extremes.jl which seems to be the slow operation, not surprising.