[ANN] ThreadsX.jl: Parallelized Base functions

Yeah, it’s a neat property that you can swap the implantation in Python easily by creating a Numpy-compatible package. Though one thing that bothers me with this approach is that you still need to edit the code. Of course, we can pass it around as an argument:

ismaxgreater(impl, f, g, xs) = impl.maximum(f, xs) > impl.maximum(g, xs)
ismaxgreater(f, g, xs) = ismaxgreater(ThreadsX, f, g, xs)

However, passing module as an argument in Julia causes type-instability. I think “impl” object has to be something that has type-stable getproperty (e.g., named tuple).

In JuliaLang/julia#34185 vchuravy suggested to use something like C++'s execution policies which has a similar outcome. I want do something similar with Transducers.jl too: Composable API for asserting algebraic properties and specifying execution strategy · Issue #143 · tkf/Transducers.jl. I think approach like this also helps us get something like ThreadsX.jl for Distributed.jl easily (freely?).

In terms of the syntax, I think impl.sort!(xs) makes more sense (somewhat) than sort!(xs, impl) as it’s more clear that the first syntax does not mutate impl object. It also makes sense as impl is just a “type-stable module”. It’s also nice that this can be done completely outside Base. OTOH, this syntax creates a hard boundary between pre-defined API (e.g., sort! and maximum) between user-defined API; i.e., ismaxgreater has to be called with impl as an argument and not impl.ismaxgreater. (Though I think it is possible to create a clever impl object that overlays ismaxgreater on top of existing impl.)

ThreadsX is ignoring all these considerations and exposing the API in a dead-simple way as designing something like this takes time. I also think you don’t need this kind of API until there are multiple implementations that you can swap. I think it is also important to remind yourself that things passed to ThreadsX functions has to be “thread-friendly” (in the sense described in GitHub - tkf/ThreadsX.jl: Parallelized Base functions) when reading the code. I think an explicit prefix helps with it.

1 Like

Great explanation, thanks!

Would this package be an appropriate package to include a multi-threaded version of broadcast? I’m imagining a macro @threaded a .+ log.(b . ^ 2) which coverts that expression into a fused, multi-threaded broadcast kernel.

Strided.jl contains a macro @strided that works even with Julia 1.0. I guess I should at least link it from my README.

1 Like

Yeah, I’m aware of Strided.jl, but it has some limitations that your Transducers.jl library doesn’t have, like it only works on dense, strided arrays. I suspect there are other advantages to your transducer based approach such as load balancing, composable scheduling and so forth. Maybe I can try to mock up a package for this.

Ah, I see. I guess load-balancing in broadcasting is not super important (as it’s all about “vectorizing” the same operation). Also, it’s not so hard to put it in Strided.jl. For nested parallelism, I think making @threads for nestable with your PR (or the other one) is enough.

Having said that, I think transducer-based implementation is going to be appealing if we have some kind of filtering built into the broadcasting syntax. I keep thinking about this but it’s not obvious to me (in the sense it’s hard to come up with something better than the iterator comprehension syntax).

I also agree that the implementation strategy based on foldl rather than explicit indexing may be beneficial for non-dense/strided arrays. I’ve been wanting to implement something like Home · NDReducibles.jl for sparse arrays that uses SIMD.jl’s gather instruction.

1 Like

Does that mean that the current benchmarks are seeing that improvement without any benefits from simd? If so that’s really impressive and would mean they’d get much faster when combined. It is my understanding that if you call most base functions with a dot broadcast it will automatically use avx instructions.

I was meant to say that ThreadsX.jl uses SIMD via auto-vectorization just as much as Base functions. Like Base function (or straight-forward manual implementation), I’m not using SIMD explicitly (e.g., via SIMD.jl) and only rely on the compiler that magically does this for me.

So yeah, it’d be interesting to see how much more performance we can squeeze out when optimizing the single-thread performance. I think it should be straight-forward to add (say) simdvec::Int = n option to use SIMD.Vec{n} and also to add unroll::Int option to manually tweak unrolling factor. Of course, we have LoopVectorization.jl (and the compiler) that automates this process so I’m not sure if this is a good feature to add.

What about

julia> using ThreadsX, BenchmarkTools

julia> const impl = ThreadsX # Base
ThreadsX

julia> selfdot(x) = impl.mapreduce(abs2 , +, x)
selfdot (generic function with 1 method)

julia> x = rand(10^4);

julia> @code_warntype selfdot(x)
Variables
  #self#::Core.Compiler.Const(selfdot, false)
  x::Array{Float64,1}

Body::Any
1 ─ %1 = ThreadsX.mapreduce::Core.Compiler.Const(ThreadsX.mapreduce, false)
│   %2 = (%1)(Main.abs2, Main.:+, x)::Any
└──      return %2

julia> @btime $x' * $x
  1.204 μs (0 allocations: 0 bytes)
3313.78577447527

julia> @btime selfdot($x)
  94.007 μs (12187 allocations: 630.94 KiB)
3313.78577447527

The ThreadsX.mapreduce inferred correctly, but the mapreduce itself did not. Hopefully fixing that will make it competitive with the base dot product, although I may have to specify

function mydot(x, y)
    init = zero(promote_type(eltype(x), eltype(y)))
    ThreadsX.mapreduce(Base.FastMath.mul_fast, Base.FastMath.add_fast, x, y, init = init)
end

If that’s required for SIMD reductions.

fetch(::Task) cannot be inferred so parallel mapreduce cannot be inferred (without using the compiler internal). I probably shouldn’t have brought up inferrability as it’s not much relevant for threaded code. I think the costs of dynamic dispatches have to be negligible compared to per-thread computation time, at least for now.

Spawning a task and fetching the result from it has some overhead:

julia> @btime fetch(Threads.@spawn nothing)
  1.211 μs (6 allocations: 736 bytes)

So we need the single-thread computation to take much longer than a few microseconds.

In terms of the single-thread performance (which can be measured by setting basesize to “infinity”), it is somewhat comparable to * if you specify simd=true:

julia> @btime $x' * $x
  1.424 μs (0 allocations: 0 bytes)
3296.5514299233864

julia> @btime ThreadsX.mapreduce(*, +, $x, $x; simd=true, basesize=typemax(Int))
  1.676 μs (11 allocations: 496 bytes)
3296.551429923387
1 Like

…which should probably be the default, now that I think about it :slight_smile: For example, Base’s mapreduce uses it.

I’d be all for having simd = true as default.

Hey @tkf,

I updated my container with perf, so now I can run your

function perf(f, args)
    pid = getpid()
    cmd = `perf $args --pid=$pid`
    proc = run(pipeline(cmd, stdout=stdout, stderr=stderr); wait=false)
    try
        return f()
    finally
        flush(stdout)
        flush(stderr)
        kill(proc, Base.SIGINT)
        wait(proc)
    end
end

using BenchmarkTools, ThreadsX
b = @benchmarkable ThreadsX.sort($(rand(0:0.01:1, 1_000_000)); alg=MergeSort)
tune!(b)

perf(`stat -ddd`) do
    run(b)
end |> display

Here’s the result (using the Manifest.toml in the ThreadsXBenchmarks opt branch):

 Performance counter stats for process id 'xxxxxx':

        320,806.28 msec task-clock:u              #   61.441 CPUs utilized
                 0      context-switches:u        #    0.000 K/sec
                 0      cpu-migrations:u          #    0.000 K/sec
           780,457      page-faults:u             #    0.002 M/sec
   998,915,051,084      cycles:u                  #    3.114 GHz                      (33.17%)
   105,151,583,800      stalled-cycles-frontend:u #   10.53% frontend cycles idle     (33.24%)
   759,847,656,100      stalled-cycles-backend:u  #   76.07% backend cycles idle      (33.32%)
   469,608,517,210      instructions:u            #    0.47  insn per cycle
                                                  #    1.62  stalled cycles per insn  (33.39%)
    82,639,767,893      branches:u                #  257.600 M/sec                    (33.47%)
     1,999,816,791      branch-misses:u           #    2.42% of all branches          (33.49%)
   123,771,767,763      L1-dcache-loads:u         #  385.815 M/sec                    (33.49%)
     3,366,246,284      L1-dcache-load-misses:u   #    2.72% of all L1-dcache hits    (33.50%)
   <not supported>      LLC-loads:u
   <not supported>      LLC-load-misses:u
     3,417,363,890      L1-icache-loads:u         #   10.652 M/sec                    (33.48%)
        27,569,790      L1-icache-load-misses:u   #    0.81% of all L1-icache hits    (33.42%)
       179,455,568      dTLB-loads:u              #    0.559 M/sec                    (33.36%)
        56,062,290      dTLB-load-misses:u        #   31.24% of all dTLB cache hits   (33.28%)
         1,656,153      iTLB-loads:u              #    0.005 M/sec                    (33.20%)
           580,824      iTLB-load-misses:u        #   35.07% of all iTLB cache hits   (33.14%)
     1,401,532,572      L1-dcache-prefetches:u    #    4.369 M/sec                    (33.12%)
   <not supported>      L1-dcache-prefetch-misses:u

       5.221355168 seconds time elapsed

BenchmarkTools.Trial:
  memory estimate:  22.30 MiB
  allocs estimate:  41828
  --------------
  minimum time:     7.516 ms (0.00% GC)
  median time:      9.815 ms (0.00% GC)
  mean time:        10.559 ms (10.14% GC)
  maximum time:     16.400 ms (42.81% GC)
  --------------
  samples:          474
  evals/sample:     1

I wish I had access to this monster machine so that I can debug the saturation/degradation in the performance

That may be possible, I sent you an email.

2 Likes

I like that perf function of yours, I wonder if that could go in a benchmarking package?

What JULIA_NUM_THREADS did you use to get that perf output?

I’d be nice if you can start julia with (say) JULIA_NUM_THREADS=30 (the peak) and JULIA_NUM_THREADS=64 (max cores) and compare the perf output of those runs. It may print something interesting (it’s kind of a random suggestion, though).

This was one of “be careful what you wish for” things :laughing: (I replied to the email. Thanks!)

Do you mean BenchmarkTools.jl? I think that perf function is well decoupled from BenchmarkTools.jl so I think it’s better to put it in a separate package. Maybe I’ll create a small package with it.

1 Like

I like that perf function of yours, I wonder if that could go in a benchmarking package?
Do you mean BenchmarkTools.jl

No, I guess it should be in a separate package, since it has a binary dependeny - also, perf is Linux-only, obviously. Maybe we could make a Perf.jl, with a perf() function and a @perf-benchmark macro that runs perf in combination with BenchmarkTools, like you do in your example?

Ideally, there would be a way to access CPU performance counters in a cross-platform way in Julia or some package, but until there is, a wrapper for perf could come in very handy. Or is there already an effort underway, regarding performance counter access?

Yeah, I think it’d be nice to support BenchmarkTools.jl from Perf.jl.

I actually think trying to make julia work seamlessly with external tools is a good direction (like PProf.jl and MCAnalyzer.jl). IIUC Intel’s VTune is also a similar tool that uses the same performance counters as perf. It’d be nice to use their GUI.

Yes, definitely! I love PProf.jl (haven’t tried MCAnalyser.jl yet).

Thanks to @tomhaber’s efforts, there’s PAPI.jl now (yay!).

1 Like