[ANN] ThreadsX.jl: Parallelized Base functions

oschulz · March 30, 2020, 3:01pm

This benchmark really runs for a while …

Here’s an intermediate result, running with JULIA_NUM_THREADS=64 on a 64-code (128 hyperthreads) Epyc-2.

sort_agg_plot

fold_agg_plot

After this is through, I’ll try to run one more time with thread-pinning, to make sure Linux doesn’t assign two Julia threads to the same physical core. Shouldn’t happen, usually, under high load if other cores are free, but …

tkf · March 30, 2020, 10:22pm

Awesome! Thanks a lot for running the benchmark. It looks a lot better than mine It’s especially nice to see many reduce-like benchmarks can follow the scaling “upper bound” sum_sin with somewhat large number of threads (~15?).

By the way, I think this is from Julia 1.4 or 1.3, right? I’m having trouble determining the source of deadlock in quicksort QuickSort hangs · Issue #104 · tkf/ThreadsX.jl · GitHub so I’d like to know when the deadlock happens or does not happen.

Sorry, I forgot to remove some very excessive/redundant benchmarks … I should’ve optimized it a bit more.

oschulz · March 30, 2020, 10:27pm

Running with Julia v1.4 (almost done).

@tkf, I’d like to re-run this with thread pinning, but with the benchmark running for so long, is there a way to specify (e.g. via an array of number of threads) on how many threads to test? Like, [1, 2, 3, 4, 6, ...., 20, 25, 30, ...]?

tkf · March 30, 2020, 10:32pm

You can pass the list of threads to try like ThreadsXBenchmarks.run_all("OUTDIR", nthreads_range=[1, 2, 3, ...]). But I think I can speedup each benchmark more.

tkf · March 30, 2020, 10:35pm

I think this opt branch reduces benchmark time: GitHub - tkf/ThreadsXBenchmarks.jl at opt (I haven’t tried it myself yet)

oschulz · March 30, 2020, 11:36pm

Will run overnight with the opt branch.

Have to re-run anyhow, turns out there was some low-priority numerical tasks running in parallel that grabbed time on hyper-threads - after I paused them, performance went up. Here are the plots:

sort_agg_plot

fold_agg_plot

oschulz · March 30, 2020, 11:37pm

The jump at the end is due to pausing those other processes.

tkf · March 30, 2020, 11:41pm

That’s great

By the way, the opt branch is not compatible with the plotting script. I’ll fix it later today.

oschulz · March 31, 2020, 12:05am

Ok, started with “opt”, I’ll see if I can run with 1:1:64 threads to get full detail. I pinned the threads now, should ensure one thread on each physical core. Night here now, hope I have results for you tomorrow.

Elrod · March 31, 2020, 7:11am

Yes, I should add an @avx_threaded macro that breaks up (what it chooses as) the outer most loop into Threads.nthreads() chunks and uses @spawn.
The trickiest part would be having it support reductions. It’d be nice if, for example, we could define a threaded 3-arg dot product like this:

function mydot(x, A, y)
    s = zero(promote_type(eltype(x), eltype(A), eltype(y)))
    @avx_threaded for m in 1:size(A,1), n in 1:size(A,2)
        s += x[m] * A[m,n] * y[n]
    end
    s
end

or maybe

function mydot(x, A, y)
    s = zero(promote_type(eltype(x), eltype(A), eltype(y)))
    @avx threads=true for m in 1:size(A,1), n in 1:size(A,2)
        s += x[m] * A[m,n] * y[n]
    end
    s
end

I’ll try and get around to it in the next few weeks. (If anyone has an API preference, please let me know! I’m leaning towards the threads=true)

EDIT:
If folks are curious about the hangs, it seems to be a problem with Julia master. I could not reproduce on Julia 1.4, nor could I on the master branch from November. I’m running a script by @tkf now to automate bisecting.

oschulz · March 31, 2020, 8:02am

Hey, @tkf, benchmark ran through quite quickly (in comparison to before), took about 2 hours for 1 to 64 threads. Did you have a chance to update the plotting script yet?

tkf · March 31, 2020, 8:03am

Yeah, I just pushed to opt branch. It worked with a small dataset in my laptop.

tkf · March 31, 2020, 8:08am

I guess you are assuming that the outer most axis is much longer than Threads.nthreads() and the work load per element is more-or-less constant? I suppose that’s reasonable when the user is asking to vectorize the loop. But, if you want to parallelize all but the inner most axis, it may be useful to use halve function from SplittablesBase.jl (which can handle IndexCartesian-style arrays and zip of them). This is how I support something like ThreadsX.foreach(f, A, A'). I think this approach is flexible enough to handle “non-rectangular” iterations like upper-triangular part of the matrix.

It’s interesting as I thought reduction would be the easiest part as there is no mutation (e.g., it sounds hard to chunk the iteration space appropriately to avoid simultaneously writing to the shared region). Though maybe there are something subtle when mixing with vectorization? Naively, I’d imagine it’d be implemented as a transformation to mapreduce:

Separate the loop body to the mapping ((m, n) -> x[m] * A[m,n] * y[n]) and reduction (+) parts and generate functions for them.
Determine unroll factor and SIMD vector width.
Feed those functions and parameters to parallelized and vectorized mapreduce.

oschulz · March 31, 2020, 8:16am

Thanks - here we go

sort_agg_plot

fold_agg_plot

System info: Julia v1.4 (official build), AMD EPYC 7702P, 512 GB RAM, 64 Julia threads pinned to 64 physical cores, no VM, running in a Singularity container (should not have a performance impact).

oschulz · March 31, 2020, 8:25am

I wonder if the CPU is starting to clock down a bit above 40 threads on the sum_sin benchmark, or if we’re starting to see the Julia thread-scheduler overhead there.

oschulz · March 31, 2020, 8:26am

The merge sort scaling is also interesting - threads fighting over cache?

tkf · March 31, 2020, 11:15pm

Thanks a lot for doing the benchmarks! I wish I had access to this monster machine so that I can debug the saturation/degradation in the performance.

I wonder if you can spot something with perf stat (when changing JULIA_NUM_THREADS). For digging into single-thread performance it’s useful sometimes. Not sure if it’s the best way to do this but I usually run something like this:

function perf(f, args)
    pid = getpid()
    cmd = `perf $args --pid=$pid`
    proc = run(pipeline(cmd, stdout=stdout, stderr=stderr); wait=false)
    try
        return f()
    finally
        flush(stdout)
        flush(stderr)
        kill(proc, Base.SIGINT)
        wait(proc)
    end
end

using BenchmarkTools, ThreadsX
b = @benchmarkable ThreadsX.sort($(rand(0:0.01:1, 1_000_000)); alg=MergeSort)
tune!(b)

perf(`stat -ddd`) do
    run(b)
end |> display

oschulz · March 31, 2020, 11:51pm

Hi @tfk, you’re very welcome, I think really like your work on ThreadsX and Transducers. Also, I guess Julia’s new multit-hreading hasn’t often been torture-tested at that level so far, so maybe we can learn lot’s of interesting things here.

Only problem, looks like I don’t have perf available in my container instance … gimme a day or so, need to add that.

tkf · April 1, 2020, 1:54am

It’s great to hear that you like it!

Please don’t worry about it. (Though it’d be nice if you remember it when you are very board or something.) I guess debugging performance problem is very hard with “asynchronous” dialog like this.

cscherrer · April 1, 2020, 4:28pm

This looks great! I wonder… In Python, everyone does import numpy as np to load numeric Python. Then an autograd package came along. To make it easy to use, they suggest import autograd.numpy as np, so you don’t have to change the rest of the file at all.

Could an approach like this work for ThreadsX?

Topic		Replies	Views
Huge performance fluctuations in parallel benchmark: insights? Performance parallel , multithreading , benchmarktools	52	2686	December 1, 2021
ThreadsX mapreduce performance Performance multithreading	12	1190	October 18, 2021
Threads.@threads with ONE thread: how to remove the overhead? Performance threads	10	4048	April 2, 2021
Innefficient paralellization? Need some help optimizing a simple dot product Performance question , parallel	32	4843	March 28, 2018
Huge performance improvement by separating function? General Usage	15	1189	August 27, 2018

[ANN] ThreadsX.jl: Parallelized Base functions

Related topics