This looks cool. It might help to have a simple problem where this creates economically meaningful performance gains over the standard @threads
. Are we talking 0.01% faster or 10% faster?
If you are going to introduce a breaking change, I wonder if it makes sense to use something like @pthreads pool for ...
where pool
can be any instance of the pool (pthreads
= pooled threads). You might want to use the same code with different pool.
Of course @Sukera and @JeffreySarnoff are correct - there is really no established API to break at this point (except my own stuff - easily solved). I like the āq-firstā suggestion as well.
@tbeason, the usage example above (while contrived) is an example. If you have a number of jobs that increase in length over the range, the the standard @threads
macro will schedule all of the longest jobs onto the last thread. The others will finish early. (In the above example, jobs 7 and 8 would have been on the last thread, completing in 1.5s.vs 1.2s.) In a less contrived example, if 1% of your jobs are 100x as long as the rest, then your completion time will almost solely on how they happen to get scheduled, and can vary wildly from run to run. If they all end up on one thread, there will be a lot of idle threads waiting for that last to complete. Canāt put a number on it - it totally depends on the distribution of your task durations. If they are all equal, the queuing will be less efficient than a pre-division.
@tkf - Definitely something to think about. Maybe an advanced user layer between the simple API and the lower-level ThreadPool?
Love the feedback folks - thanks.
Love pools of parallelism
Looking forward to DistributedPools.jl
, GPUPools.jl
ā¦
Iāve already run into the need for Distributed Pools. I am running on a 12-CPU, quad core system, and I find that when I go from 4 to 5 threads, the individual task times double. (They have to access the DB back on the first CPU). So Iāve moved to 12 Distributed ThreadPools of 4 threads each.
I like the suggestion very much. You could have a macro @pthreads pool for ...
and then set a default pool type so that end users could simply do @pthreads for ...
? Alias macros like @qbgthreads, @fgthreads
may also be useful, but Iād say @pthreads bg for ...
is more elegant
I think I agree - makes it easier to slip in LoggingThreadPool
s or LoopDeLoopPool
s or whatever else folks dream up. Then the simple API just becomes a series of aliases for that layer.
Interesting update on the DistributedPools.jl
concept above - I thought I was running on 12 CPUs, 4 cores per CPU based on behavior, but this was incorrect. Using versioninfo()
, it says my my single Linux CPU (I develop on Windows, release to the team on Linux) is an Intel Xeon Gold 6146: single 12-core, 24-thread CPU. This implies 2 threads per core, which is the same as other Intel CPUs I see.
But the 4-thread threshold I quote above is not ambiguous. Moving from 4 to 5 threads slows all tasks by 50% (including the Base.@threads
). I am betting it is a hardware issue, but I will benchmark on both Windows (i7-8550U - 4 cores/8 threads, zero sensitivity to thread count) and Linux just to see if there is a difference.
Could that be related to different NUMA nodes? Iām not sure if Intel cpus have multiple numa nodes though
Could be. Iām working on the upgraded version of this package which includes a nice visualization trick from @tkf. Once the package stabilizes, Iām going to turn that visualizer toward a more fine-grained investigation on the thread-count-vs-job-speed behavior.
Can also be due to how CPU cores share cache? Maybe sharing the output of lscpu
and lscpu --extended
would help?
Also either way, maybe profiling the program using perf stat
can be informative (e.g., using 5 threads results in more cache misses?).
My experience is, that I get not always all available threads from the @threads
macro. Iām currently investigating this.
When doing performance measurements you must look at the threads you got actually for your tasks, not only at nthreads()
or JULIA_NUM_THREADS
.
Maybe this is the issue you are looking for.
Ironic - just got the lscpu
from IT. NUMA looks like it is ruled out - there are only 2 NUMA nodes. I will work on the perf stat
front soon.
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 48
On-line CPU(s) list: 0-47
Thread(s) per core: 2
Core(s) per socket: 12
Socket(s): 2
NUMA node(s): 2
Vendor ID: GenuineIntel
CPU family: 6
Model: 85
Model name: Intel(R) Xeon(R) Gold 6146 CPU @ 3.20GHz
Stepping: 4
CPU MHz: 3192.488
BogoMIPS: 6381.67
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 1024K
L3 cache: 25344K
NUMA node0 CPU(s): 0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38,40,42,44,46
NUMA node1 CPU(s): 1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31,33,35,37,39,41,43,45,47
@pbayer - thanks. I double-checked my logs (I track each task by thread), and can see all threads are active. They are all (ie not just 5+, and that is news to me as of this post) slower on a per-task basis at 5 threads versus 4.
@tkf Attached below are the visualization hookups we were talking about, in this case for the LoggingQueuedThreadPool (Channel{Task}
in/out, handlers take new tasks as previous ones finish) running the same loading function from your tmapreduce
example (just the work function).
Hereās the interesting bit. The first plot is the job times as seen by the jobs themselves. The second plot is the job times as seen by the logger, which includes the schedule
and wait
ing. As far as I can tell, there is always a delay in between the scheduling and start time of the first Task on a given thread with this setup. Iāve confirmed the handlers are up and running <1ms in, and the return from schedule
is very fast. Pretty fascinating - Iāll debug more and see if a pin
method sugested by @vtjnash gets around this overhead.
Of course, with a heavier work function (~10ms/job here), that setup overhead doesnāt seem so bad. This, by the way is with @tkf & @juliohmās API suggestion:
pool = LoggingQueuedThreadPool("log.txt")
@pthreads pool for i in 1:64
work(io, i)
end
close(pool)
showprofile("log.txt")
Nice to see ThreadPools.jl API evolving
For the API, Iād suggest something like this
pool = LoggingQueuedThreadPool("log.txt") do pool
@pthreads pool for i in 1:64
work(io, i)
end
end # close(pool) automatically called
plot(pool) # plotted via `@recipe`
BTW, using x axis for time makes total sense
As an aside, while testing on a different (8-thread) machine, I ran across an example of the above statement that using the primary for both computation and thread management can cause resulting times to vary. Here, you can see that 3/4 of the way through the work, the job on the primary got delayed by something. (No idea what, but in a non-uniform computation case, this could have been just a long task.) As a result, all the threads were delayed in getting their next task, because the primary thread houses the in/out queues. Particularly on an 8-thread machine like this, Iāll likely always default to background threads.
How GC time is reflected in this chart? Mark and sweep has to stop the world.
The chart just shows start and stop times. If a task is parked due to a GC sweep, it will simply show up as being longer. Itās an interesting question - I assume GC takes place on the main thread?
I cannot find any detailed description, so maybe someone could answer or point to the docs:
- Is GC stopping only a single task or all the tasks running in all threads? (I think stopping all tasks is needed to not crash process)
- As any task could trigger GC during allocation, is GC run in its thread, or control is passed to the main thread? (running in the triggering taskās thread makes more sense to me)
I worked a lot with languages that have moving garbage collectors, so my mental model might be wrong as Juliaās GC is non-moving.