Here’s an intermediate result, running with JULIA_NUM_THREADS=64 on a 64-code (128 hyperthreads) Epyc-2.
After this is through, I’ll try to run one more time with thread-pinning, to make sure Linux doesn’t assign two Julia threads to the same physical core. Shouldn’t happen, usually, under high load if other cores are free, but …
Awesome! Thanks a lot for running the benchmark. It looks a lot better than mine It’s especially nice to see many reduce-like benchmarks can follow the scaling “upper bound” sum_sin with somewhat large number of threads (~15?).
By the way, I think this is from Julia 1.4 or 1.3, right? I’m having trouble determining the source of deadlock in quicksort QuickSort hangs · Issue #104 · tkf/ThreadsX.jl · GitHub so I’d like to know when the deadlock happens or does not happen.
Sorry, I forgot to remove some very excessive/redundant benchmarks … I should’ve optimized it a bit more.
@tkf, I’d like to re-run this with thread pinning, but with the benchmark running for so long, is there a way to specify (e.g. via an array of number of threads) on how many threads to test? Like, [1, 2, 3, 4, 6, ...., 20, 25, 30, ...]?
You can pass the list of threads to try like ThreadsXBenchmarks.run_all("OUTDIR", nthreads_range=[1, 2, 3, ...]). But I think I can speedup each benchmark more.
Have to re-run anyhow, turns out there was some low-priority numerical tasks running in parallel that grabbed time on hyper-threads - after I paused them, performance went up. Here are the plots:
Ok, started with “opt”, I’ll see if I can run with 1:1:64 threads to get full detail. I pinned the threads now, should ensure one thread on each physical core. Night here now, hope I have results for you tomorrow.
Yes, I should add an @avx_threaded macro that breaks up (what it chooses as) the outer most loop into Threads.nthreads() chunks and uses @spawn.
The trickiest part would be having it support reductions. It’d be nice if, for example, we could define a threaded 3-arg dot product like this:
function mydot(x, A, y)
s = zero(promote_type(eltype(x), eltype(A), eltype(y)))
@avx_threaded for m in 1:size(A,1), n in 1:size(A,2)
s += x[m] * A[m,n] * y[n]
end
s
end
or maybe
function mydot(x, A, y)
s = zero(promote_type(eltype(x), eltype(A), eltype(y)))
@avx threads=true for m in 1:size(A,1), n in 1:size(A,2)
s += x[m] * A[m,n] * y[n]
end
s
end
I’ll try and get around to it in the next few weeks. (If anyone has an API preference, please let me know! I’m leaning towards the threads=true)
EDIT:
If folks are curious about the hangs, it seems to be a problem with Julia master. I could not reproduce on Julia 1.4, nor could I on the master branch from November. I’m running a script by @tkf now to automate bisecting.
Hey, @tkf, benchmark ran through quite quickly (in comparison to before), took about 2 hours for 1 to 64 threads. Did you have a chance to update the plotting script yet?
I guess you are assuming that the outer most axis is much longer than Threads.nthreads() and the work load per element is more-or-less constant? I suppose that’s reasonable when the user is asking to vectorize the loop. But, if you want to parallelize all but the inner most axis, it may be useful to use halve function from SplittablesBase.jl (which can handle IndexCartesian-style arrays and zip of them). This is how I support something like ThreadsX.foreach(f, A, A'). I think this approach is flexible enough to handle “non-rectangular” iterations like upper-triangular part of the matrix.
It’s interesting as I thought reduction would be the easiest part as there is no mutation (e.g., it sounds hard to chunk the iteration space appropriately to avoid simultaneously writing to the shared region). Though maybe there are something subtle when mixing with vectorization? Naively, I’d imagine it’d be implemented as a transformation to mapreduce:
Separate the loop body to the mapping ((m, n) -> x[m] * A[m,n] * y[n]) and reduction (+) parts and generate functions for them.
Determine unroll factor and SIMD vector width.
Feed those functions and parameters to parallelized and vectorized mapreduce.
System info: Julia v1.4 (official build), AMD EPYC 7702P, 512 GB RAM, 64 Julia threads pinned to 64 physical cores, no VM, running in a Singularity container (should not have a performance impact).
I wonder if the CPU is starting to clock down a bit above 40 threads on the sum_sin benchmark, or if we’re starting to see the Julia thread-scheduler overhead there.
Thanks a lot for doing the benchmarks! I wish I had access to this monster machine so that I can debug the saturation/degradation in the performance.
I wonder if you can spot something with perf stat (when changing JULIA_NUM_THREADS). For digging into single-thread performance it’s useful sometimes. Not sure if it’s the best way to do this but I usually run something like this:
function perf(f, args)
pid = getpid()
cmd = `perf $args --pid=$pid`
proc = run(pipeline(cmd, stdout=stdout, stderr=stderr); wait=false)
try
return f()
finally
flush(stdout)
flush(stderr)
kill(proc, Base.SIGINT)
wait(proc)
end
end
using BenchmarkTools, ThreadsX
b = @benchmarkable ThreadsX.sort($(rand(0:0.01:1, 1_000_000)); alg=MergeSort)
tune!(b)
perf(`stat -ddd`) do
run(b)
end |> display
Hi @tfk, you’re very welcome, I think really like your work on ThreadsX and Transducers. Also, I guess Julia’s new multit-hreading hasn’t often been torture-tested at that level so far, so maybe we can learn lot’s of interesting things here.
Only problem, looks like I don’t have perf available in my container instance … gimme a day or so, need to add that.
Please don’t worry about it. (Though it’d be nice if you remember it when you are very board or something.) I guess debugging performance problem is very hard with “asynchronous” dialog like this.
This looks great! I wonder… In Python, everyone does import numpy as np to load numeric Python. Then an autograd package came along. To make it easy to use, they suggest import autograd.numpy as np, so you don’t have to change the rest of the file at all.