[ANN] ThreadPools.jl - Improved thread management for background and nonuniform tasks

This looks cool. It might help to have a simple problem where this creates economically meaningful performance gains over the standard @threads. Are we talking 0.01% faster or 10% faster?

If you are going to introduce a breaking change, I wonder if it makes sense to use something like @pthreads pool for ... where pool can be any instance of the pool (pthreads = pooled threads). You might want to use the same code with different pool.

5 Likes

Of course @Sukera and @JeffreySarnoff are correct - there is really no established API to break at this point (except my own stuff - easily solved). I like the “q-first” suggestion as well.

@tbeason, the usage example above (while contrived) is an example. If you have a number of jobs that increase in length over the range, the the standard @threads macro will schedule all of the longest jobs onto the last thread. The others will finish early. (In the above example, jobs 7 and 8 would have been on the last thread, completing in 1.5s.vs 1.2s.) In a less contrived example, if 1% of your jobs are 100x as long as the rest, then your completion time will almost solely on how they happen to get scheduled, and can vary wildly from run to run. If they all end up on one thread, there will be a lot of idle threads waiting for that last to complete. Can’t put a number on it - it totally depends on the distribution of your task durations. If they are all equal, the queuing will be less efficient than a pre-division.

@tkf - Definitely something to think about. Maybe an advanced user layer between the simple API and the lower-level ThreadPool?

Love the feedback folks - thanks.

3 Likes

Love pools of parallelism :heart:

Looking forward to DistributedPools.jl, GPUPools.jl:slight_smile:

1 Like

I’ve already run into the need for Distributed Pools. :slight_smile: I am running on a 12-CPU, quad core system, and I find that when I go from 4 to 5 threads, the individual task times double. (They have to access the DB back on the first CPU). So I’ve moved to 12 Distributed ThreadPools of 4 threads each. :slight_smile:

1 Like

I like the suggestion very much. You could have a macro @pthreads pool for ... and then set a default pool type so that end users could simply do @pthreads for ...? Alias macros like @qbgthreads, @fgthreads may also be useful, but I’d say @pthreads bg for ... is more elegant :slight_smile:

2 Likes

I think I agree - makes it easier to slip in LoggingThreadPools or LoopDeLoopPools or whatever else folks dream up. :slight_smile: Then the simple API just becomes a series of aliases for that layer.

Interesting update on the DistributedPools.jl concept above - I thought I was running on 12 CPUs, 4 cores per CPU based on behavior, but this was incorrect. Using versioninfo(), it says my my single Linux CPU (I develop on Windows, release to the team on Linux) is an Intel Xeon Gold 6146: single 12-core, 24-thread CPU. This implies 2 threads per core, which is the same as other Intel CPUs I see.

But the 4-thread threshold I quote above is not ambiguous. Moving from 4 to 5 threads slows all tasks by 50% (including the Base.@threads). I am betting it is a hardware issue, but I will benchmark on both Windows (i7-8550U - 4 cores/8 threads, zero sensitivity to thread count) and Linux just to see if there is a difference.

Could that be related to different NUMA nodes? I’m not sure if Intel cpus have multiple numa nodes though

Could be. I’m working on the upgraded version of this package which includes a nice visualization trick from @tkf. Once the package stabilizes, I’m going to turn that visualizer toward a more fine-grained investigation on the thread-count-vs-job-speed behavior.

Can also be due to how CPU cores share cache? Maybe sharing the output of lscpu and lscpu --extended would help?

Also either way, maybe profiling the program using perf stat can be informative (e.g., using 5 threads results in more cache misses?).

1 Like

My experience is, that I get not always all available threads from the @threads macro. I’m currently investigating this.

When doing performance measurements you must look at the threads you got actually for your tasks, not only at nthreads() or JULIA_NUM_THREADS.

Maybe this is the issue you are looking for.

Ironic - just got the lscpu from IT. NUMA looks like it is ruled out - there are only 2 NUMA nodes. I will work on the perf stat front soon.

Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                48
On-line CPU(s) list:   0-47
Thread(s) per core:    2
Core(s) per socket:    12
Socket(s):             2
NUMA node(s):          2
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 85
Model name:            Intel(R) Xeon(R) Gold 6146 CPU @ 3.20GHz
Stepping:              4
CPU MHz:               3192.488
BogoMIPS:              6381.67
Virtualization:        VT-x
L1d cache:             32K
L1i cache:             32K
L2 cache:              1024K
L3 cache:              25344K
NUMA node0 CPU(s):     0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38,40,42,44,46
NUMA node1 CPU(s):     1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31,33,35,37,39,41,43,45,47

@pbayer - thanks. I double-checked my logs (I track each task by thread), and can see all threads are active. They are all (ie not just 5+, and that is news to me as of this post) slower on a per-task basis at 5 threads versus 4.

@tkf Attached below are the visualization hookups we were talking about, in this case for the LoggingQueuedThreadPool (Channel{Task} in/out, handlers take new tasks as previous ones finish) running the same loading function from your tmapreduce example (just the work function).

Here’s the interesting bit. The first plot is the job times as seen by the jobs themselves. The second plot is the job times as seen by the logger, which includes the schedule and waiting. As far as I can tell, there is always a delay in between the scheduling and start time of the first Task on a given thread with this setup. I’ve confirmed the handlers are up and running <1ms in, and the return from schedule is very fast. Pretty fascinating - I’ll debug more and see if a pin method sugested by @vtjnash gets around this overhead.
Figure_2
Figure_1

1 Like

Of course, with a heavier work function (~10ms/job here), that setup overhead doesn’t seem so bad. This, by the way is with @tkf & @juliohm’s API suggestion:

pool = LoggingQueuedThreadPool("log.txt")
@pthreads pool for i in 1:64
    work(io, i)
end
close(pool)
showprofile("log.txt")

Figure_3

1 Like

Nice to see ThreadPools.jl API evolving :+1:

For the API, I’d suggest something like this

pool = LoggingQueuedThreadPool("log.txt") do pool
    @pthreads pool for i in 1:64
        work(io, i)
    end
end  # close(pool) automatically called
plot(pool)  # plotted via `@recipe`

BTW, using x axis for time makes total sense :slight_smile:

2 Likes

As an aside, while testing on a different (8-thread) machine, I ran across an example of the above statement that using the primary for both computation and thread management can cause resulting times to vary. Here, you can see that 3/4 of the way through the work, the job on the primary got delayed by something. (No idea what, but in a non-uniform computation case, this could have been just a long task.) As a result, all the threads were delayed in getting their next task, because the primary thread houses the in/out queues. Particularly on an 8-thread machine like this, I’ll likely always default to background threads.

Figure_1

2 Likes

How GC time is reflected in this chart? Mark and sweep has to stop the world.

The chart just shows start and stop times. If a task is parked due to a GC sweep, it will simply show up as being longer. It’s an interesting question - I assume GC takes place on the main thread?

1 Like

I cannot find any detailed description, so maybe someone could answer or point to the docs:

  1. Is GC stopping only a single task or all the tasks running in all threads? (I think stopping all tasks is needed to not crash process)
  2. As any task could trigger GC during allocation, is GC run in its thread, or control is passed to the main thread? (running in the triggering task’s thread makes more sense to me)

I worked a lot with languages that have moving garbage collectors, so my mental model might be wrong as Julia’s GC is non-moving.

2 Likes