Bootstrapping is fast on M1 Max, slow on HPC Knights Landing / Broadwell

Hi,

I have successfully converted a statistical bootstrapping experiment originally written in Python. My workflow loads multiple datasets of grid data (Z sets of MxNxT) that represent numerical simulations with varying start conditions. The main methodology is for each cell (M,N) generate a series of size T by randomly sampling from all sets in Z at each value. A few statistical tests are then performed on the randomly sampled series. This is performed on every cell of the grid and repeated some number of times.

It is a simple program and the statistical tests take about 80 lines of code, mostly sampling the data, fitting some distributions, performing statistical tests, and returning data (all in 1 function). An additional method in the code loads the grids and iterates every cell, passing required information to the statistical procedure.

Because the grid is quite large, I originally was using sub processes in Python to run multiple instances of the statistical analysis simultaneously, which significantly improved runtime. Now I am using multithreading via Julia’s Threads platform.

Basically on my M1 Max, the program can do an entire grid in ~50 seconds using 10 threads (already a significant improvement from my previous program on an HPC). I also have access to Intel Xeon E5-2695v4 (Broadwell) and Intel Xeon Phi 7230 (Knights Landing) compute nodes for running this analysis. I have tried varying numbers of threads (1-36 on Broadwell and 1-256 on Knights Landing) but to no avail, the analysis takes 10 minutes (minimum), or multiple hours (if many threads are used). Does anyone have advice on how to speed it up? It would be great if the analysis was faster than my laptop in the HPC environment.

Here are some of my attempts to solve this:

  • Tried various # of threads
  • Set MKL_NUM_THREADS and BLAS_NUM_THREADS to 1
  • Run a sample statistical analysis before beginning the multithreaded full-grid for precompilation
  • Removed recursive calls in my code (if a statistical analysis fails, it re-calls itself with the next random seed. Wasn’t sure if this was a problem, but it’s not)
  • Isolated line by line to determine what the slow codes are. I’m using gevfit in Extremes.jl which seems to be the slow operation, not surprising.

BTW, I installed Julia using the directions here. This is the closest question to my problem that I have found.

How much allocation are you doing? If you run a small size version of the code and @time it how much time, and how much GC time is it taking? In highly multithreaded codes sometimes they have to halt to GC and threads spend a long time just waiting

Thanks a lot for your response. I will report back with that information.

In the meantime, I continued with isolating elements of the code and comparing benchmarks between KNL and M1 Max.

  • KNL 7.429 ms (5586 allocations: 323.77 KiB)
  • BDW 815.514 ÎĽs (5586 allocations: 323.77 KiB)
  • M1 Max 418.083 ÎĽs (5586 allocations: 323.77 KiB)
using Extremes, BenchmarkTools

ams = [0.30705696, 0.26690552, 0.31271043, 0.2511067 , 0.29779163,
              0.6709108 , 0.714243  , 0.21308714, 0.23412102, 0.22688317,
              0.529936  , 0.2913758 , 0.6554807 , 0.6945615 , 0.3032903 ,
              0.35447255, 0.9780486 , 0.35815057, 0.35641265, 0.23827875]

gevfit(ams)
@btime gevfit(ams)

The m1 max cores are very much faster than the HPC ones. To benefit from them I imagine you would need to use multiple clusters and distributed computing. Which don’t seem to apply to your problem.

Thanks for your response. I will be using multiple compute nodes on the cluster in parallel to run multiple simulations of the bootstrapping (lets say to get 500 distributions each based on different random samples of the data). On each compute node, the program I describe above will be running.

I was hoping that the Extremes.jl gevfit was efficient on the HPC, competitive with the performance on my laptop. So I guess it is not the case :confused:

It sounds like you are reporting that the code takes longer as you include more threads. Is this true? Have you plotted speedup versus number of threads and observed reasonable scaling? If not, then your threading strategy may need revision.

1 Like

It’s entirely plausible the M1 Max is simply at least 10x better for some things (not sure if it explains the “multiple hours”, was that for Knights Landing only?). Arm-based used to mean only low-power, not high-performance, or not Intel-level. Those days are gone (also in supercomputers).

It’s unstated but the M1 Max is a 10-core (8× high-performance + 2× high-efficiency) 5 nm Apple Arm CPU, and at least the original M1 known to be very good, world-class in single-threaded. The original was not, e.g. competing with AMD (and Intel) in multi-threaded.

And you’re using newer M1 Max 24 MB L2 cache (per core seemingly) plus 48 MB last level (so unified?) from a year ago vs. discontinued Intel 64-core 14 nm Q2’16 32 MB L2 Cache $1992.00 Server CPU and 14 nm 18-core (36-thread) Q1’16 “45 MB Intel® Smart Cache” $2424.00 CPU.

The M1 Max has humongous (which matters a lot) L1 cache 192+128 KB per core (performance cores). Intel has 32 KB. [I see AMD EPYC Embedded 3401 has “1536 KB” L1 cache but that’s misleading since it’s 16-core so presumably 95.93? KB per core?]

Why is the M1 Max so good? It has 3.5x more transistors than the original M1 (and likely at least 7x more than the Intel’s you compare to), so roughly that much better for multi-core than the M1:

M1 Max: 57 billion[5]
M1 Ultra: 114 billion (largest CPU ever made, well except for Wafer Scale Engine 2 at 2600 billion)

Largest non-Apple CPU you can run Julia on, as far as I know 2022 7 nm AMD Epyc Rome 39.54 billion (actually IBM’s Telum mainframe CPU is larger at 45 billion)

Is Intel’s 2018 14 nm Xeon Platinum 8180 (28-core) the largest at 8 billion, at least in my list).

The L1 cache of all the high-performance cores combined is though only about at least 15.7 million transistors. The vast majority of the rest or going to be for (L2+) cache too. The clock speed 3.2 GHz for the Arm is only really relevant for L1 cache. [Or at least you have more latency to L2, though possibly the frequency applies for throughput.] The base frequency is 2.10 GHz for the better Intel CPU (though higher with boost), and max for the discontinued Intel Xeon Phi 7230 (Knights Landing) is 1.50 GHz.

It only has 64 cores, so going higher is unlikely to help, and the cost for having (for it, at the time) very many cores, was that each was slow, i.e. for server use, thought-put. Maybe you need more of single-threaded, i.e. non-server chip.

1 Like

Thank you.

Let me do some scaling studies on the BDW node, since it seems like KNL will need to be a no for this analysis. I checked my Python benchmark and I’m seeing approximately 8.64 ms on BDW for the Python equivalent. If the Julia version on BDW gives me 10x speed up, that would certainly be a success.

Thank you for the detailed comparison, it is good to put these metrics in perspective. Based on my updated benchmark for BDW, it seems to take 2x longer than the performance on the M1 Max. That might be acceptable and would constitute a major improvement from the previous ~8-9 ms my Python implementation was taking.

I was having problems with Broadwell too, but I’m going to go back in and take another look. Most of my tests were ran on Knights Landing.

Allegedly each core on KNL can handle up to 4 threads which explains my attempt at 256 threads on KNL.

1 Like

Oh yes forgot about that. Hyperthreading on Intel is usually 2 threads per code, still people sometimes turn it off or only use as many threads as the physical cores, since it can be slower otherwise. That may or may not apply to 4x threading too (I find it likely, i.e. not meant for HPC). If not then yes 256 threads could make sense, just time it.

Not only is your old Intel bad, the future doesn’t seem bright (you may want to consider AMD or Arm), I found this link at top500.org:
The Pax Chipzilla Is Over, And Intel Can’t Hold Back The Barbarians

Let’s start with how bad it is for Intel right now in the datacenter, which is all we really care about at The Next Platform. […]

In a word, it’s bad.

The Data Center and AI group that sells server CPUs, chipsets, and motherboards as well as specialized chips for AI processing, saw a 27.3 percent decline to $4.21 billion and posted an extremely anemic $17 million in operating profit, which might as well be $0 when compared up against the $2.29 billion in operating profit that Intel posted against $5.79 billion in DCAI sales in the year ago period. This was with untold how many hundreds of thousands of units of the still impending “Sapphire Rapids” Xeon SP
[…]
And even if Sapphire Rapids is the fastest Xeon in history to get to 1 million units, AMD is getting set to leap ahead on November 10 with its “Genoa” Epyc 9004 series, which is probably going to leave Intel mostly with “supply wins” instead of “design wins” at a lot of customers. Intel may be able to improve its average selling prices with Sapphire Rapids, but supply wins is the only reason why. If AMD had unconstrained supply, Intel would not just be laying off people in sales and marketing, as it has announced, but it would be cutting even deeper into the organization.

Thank you. That’s a good point, but unfortunately I don’t stock the data center :grin:. The Knights Landing are not my first choice.

I went ahead and performed a small strong scaling study for # of threads on my M1 Max and the Broadwell. My study showed both the M1 Max and Broadwell to optimize at 10 threads for this analysis (for a sample grid, 41s on M1 Max and 104s on Broadwell). I am skipping the strong scaling on Knights Landing for now since it is untenable given its poor performance, the size of the full analysis, and my short timeline / upcoming deadlines.

For reference, my Python script scales up to 36 subprocesses on Broadwell and takes 257s. So the Julia implementation is approximately 2.5x faster; not bad.

With that, now I am curious. Is there a reason that on the Broadwell, a 36-core processor, my analysis is optimized at 10 threads? I set the following environmental variables (which don’t really make a difference in the scaling study, BTW). Shouldn’t I be able to use at least 36 threads? Is there a different way to parallelize, i.e., using multiple subprocesses as in the Python multiprocessing package, which would allow me to exploit all resources and perhaps get even better performance? Usually I would not care that much, but since this is a massive analysis that takes weeks, any performance increase is good.

export MKL_NUM_THREADS=1
export BLAS_NUM_THREADS=1
export OPENBLAS_NUM_THREADS=1
export VECLIB_MAXIMUM_THREADS=1
export NUMEXPR_NUM_THREADS=1
1 Like

If you have memory allocations in your code, then the garbage collector might be limiting the performance: all threads use the same garbage collector which blocks all threads when active.

You do not have this issue when you use multiple processes instead of multiple threads, but the communication overhead is higher…

3 Likes