[ANN] ThreadsX.jl: Parallelized Base functions

This benchmark really runs for a while … :slight_smile:

Here’s an intermediate result, running with JULIA_NUM_THREADS=64 on a 64-code (128 hyperthreads) Epyc-2.

sort_agg_plot

fold_agg_plot

After this is through, I’ll try to run one more time with thread-pinning, to make sure Linux doesn’t assign two Julia threads to the same physical core. Shouldn’t happen, usually, under high load if other cores are free, but …

5 Likes

Awesome! Thanks a lot for running the benchmark. It looks a lot better than mine :slight_smile: It’s especially nice to see many reduce-like benchmarks can follow the scaling “upper bound” sum_sin with somewhat large number of threads (~15?).

By the way, I think this is from Julia 1.4 or 1.3, right? I’m having trouble determining the source of deadlock in quicksort QuickSort hangs · Issue #104 · tkf/ThreadsX.jl · GitHub so I’d like to know when the deadlock happens or does not happen.

Sorry, I forgot to remove some very excessive/redundant benchmarks :sweat_smile:… I should’ve optimized it a bit more.

Running with Julia v1.4 (almost done).

@tkf, I’d like to re-run this with thread pinning, but with the benchmark running for so long, is there a way to specify (e.g. via an array of number of threads) on how many threads to test? Like, [1, 2, 3, 4, 6, ...., 20, 25, 30, ...]?

You can pass the list of threads to try like ThreadsXBenchmarks.run_all("OUTDIR", nthreads_range=[1, 2, 3, ...]). But I think I can speedup each benchmark more.

I think this opt branch reduces benchmark time: GitHub - tkf/ThreadsXBenchmarks.jl at opt (I haven’t tried it myself yet)

Will run overnight with the opt branch.

Have to re-run anyhow, turns out there was some low-priority numerical tasks running in parallel that grabbed time on hyper-threads - after I paused them, performance went up. Here are the plots:

sort_agg_plot

fold_agg_plot

The jump at the end is due to pausing those other processes.

That’s great :laughing:

By the way, the opt branch is not compatible with the plotting script. I’ll fix it later today.

Ok, started with “opt”, I’ll see if I can run with 1:1:64 threads to get full detail. I pinned the threads now, should ensure one thread on each physical core. Night here now, hope I have results for you tomorrow.

1 Like

Yes, I should add an @avx_threaded macro that breaks up (what it chooses as) the outer most loop into Threads.nthreads() chunks and uses @spawn.
The trickiest part would be having it support reductions. It’d be nice if, for example, we could define a threaded 3-arg dot product like this:

function mydot(x, A, y)
    s = zero(promote_type(eltype(x), eltype(A), eltype(y)))
    @avx_threaded for m in 1:size(A,1), n in 1:size(A,2)
        s += x[m] * A[m,n] * y[n]
    end
    s
end

or maybe

function mydot(x, A, y)
    s = zero(promote_type(eltype(x), eltype(A), eltype(y)))
    @avx threads=true for m in 1:size(A,1), n in 1:size(A,2)
        s += x[m] * A[m,n] * y[n]
    end
    s
end

I’ll try and get around to it in the next few weeks. (If anyone has an API preference, please let me know! I’m leaning towards the threads=true)

EDIT:
If folks are curious about the hangs, it seems to be a problem with Julia master. I could not reproduce on Julia 1.4, nor could I on the master branch from November. I’m running a script by @tkf now to automate bisecting.

1 Like

Hey, @tkf, benchmark ran through quite quickly (in comparison to before), took about 2 hours for 1 to 64 threads. Did you have a chance to update the plotting script yet?

1 Like

Yeah, I just pushed to opt branch. It worked with a small dataset in my laptop.

I guess you are assuming that the outer most axis is much longer than Threads.nthreads() and the work load per element is more-or-less constant? I suppose that’s reasonable when the user is asking to vectorize the loop. But, if you want to parallelize all but the inner most axis, it may be useful to use halve function from SplittablesBase.jl (which can handle IndexCartesian-style arrays and zip of them). This is how I support something like ThreadsX.foreach(f, A, A'). I think this approach is flexible enough to handle “non-rectangular” iterations like upper-triangular part of the matrix.

It’s interesting as I thought reduction would be the easiest part as there is no mutation (e.g., it sounds hard to chunk the iteration space appropriately to avoid simultaneously writing to the shared region). Though maybe there are something subtle when mixing with vectorization? Naively, I’d imagine it’d be implemented as a transformation to mapreduce:

  1. Separate the loop body to the mapping ((m, n) -> x[m] * A[m,n] * y[n]) and reduction (+) parts and generate functions for them.
  2. Determine unroll factor and SIMD vector width.
  3. Feed those functions and parameters to parallelized and vectorized mapreduce.
1 Like

Thanks - here we go :slight_smile:

sort_agg_plot

fold_agg_plot

System info: Julia v1.4 (official build), AMD EPYC 7702P, 512 GB RAM, 64 Julia threads pinned to 64 physical cores, no VM, running in a Singularity container (should not have a performance impact).

6 Likes

I wonder if the CPU is starting to clock down a bit above 40 threads on the sum_sin benchmark, or if we’re starting to see the Julia thread-scheduler overhead there.

The merge sort scaling is also interesting - threads fighting over cache?

Thanks a lot for doing the benchmarks! I wish I had access to this monster machine so that I can debug the saturation/degradation in the performance.

I wonder if you can spot something with perf stat (when changing JULIA_NUM_THREADS). For digging into single-thread performance it’s useful sometimes. Not sure if it’s the best way to do this but I usually run something like this:

function perf(f, args)
    pid = getpid()
    cmd = `perf $args --pid=$pid`
    proc = run(pipeline(cmd, stdout=stdout, stderr=stderr); wait=false)
    try
        return f()
    finally
        flush(stdout)
        flush(stderr)
        kill(proc, Base.SIGINT)
        wait(proc)
    end
end

using BenchmarkTools, ThreadsX
b = @benchmarkable ThreadsX.sort($(rand(0:0.01:1, 1_000_000)); alg=MergeSort)
tune!(b)

perf(`stat -ddd`) do
    run(b)
end |> display
2 Likes

Hi @tfk, you’re very welcome, I think really like your work on ThreadsX and Transducers. Also, I guess Julia’s new multit-hreading hasn’t often been torture-tested at that level so far, so maybe we can learn lot’s of interesting things here.

Only problem, looks like I don’t have perf available in my container instance … gimme a day or so, need to add that.

3 Likes

It’s great to hear that you like it!

Please don’t worry about it. (Though it’d be nice if you remember it when you are very board or something.) I guess debugging performance problem is very hard with “asynchronous” dialog like this.

This looks great! I wonder… In Python, everyone does import numpy as np to load numeric Python. Then an autograd package came along. To make it easy to use, they suggest import autograd.numpy as np, so you don’t have to change the rest of the file at all.

Could an approach like this work for ThreadsX?