That is problematic (EDIT: I just checked. Yes, only @threads
sets jl_enter_threaded_region
.)
I’ll have to test, but I do not expect nesting these threads to work well at all.
What will probably happen if one of the threads it wants to spawn in is busy is it’d either wait or deadlock, depending on if that thread itself is I’ve of them it wants to run on.
I’ll have to add some strongly weirded warnings.
The problem is that these pools run on regular Julia threads. If one is busy with a different task, that will cause problems – potentially deadlocks.
If I could run them on isolated, protected, threads, I would.
I’ll look more at the threading code that exists to see about how to make it work better when possibly nested. E.g., is there a way to ask what another thread is doing? E.g., I don’t want to tell the ThreadingUtilities
-managed task on thread 3 to do work if thread 3 is currently busy with a different task.
FWIW, @spawn
ing from within these threads should be fine, but running them within threaded code would need figuring out.
I don’t like the design*implementation combination, which is why I’m working on an alternative.
Spawning a single do-nothing task takes 70 microseconds on that computer. (results for 3 other computers: 40, 7, and 3.5 microseconds)
On recent x64 CPUs, the L3 cache is shared*, while the L1 and L2 cache are local.
Sometimes, e.g. on multi-socket systems or many Ryzen CPUs, the L3 is only shared between groups of cores, meaning we have clusters of cores that do not share cache at all. There should be no depth-first scheduling transferring work between them.
Is Julia’s scheduler NUMA-aware?
Also, on many CPUs, e.g. Skylake-X, the L3 cache is non-inclusive of the L1 or L2 cache, meaning it doesn’t readily contain copies of the same data to send to other cores. Importantly, it is also not that large: 1.375 MiB per core, vs the 1 MiB of private L2 cache per core.
They have a lot of cores, so the overall shared L3 size is still large, but being non-inclusive adds to the latency of accessing data from another core. I guess it doesn’t if you need to synchronize anyway (i.e., if the data was mutated).
These Skylake-X chips can fit 135_168 Float64
in their private L1 and L2 caches.
My Tiger Lake laptop’s L1 and L2 caches are even larger (Hwloc.jl
provides the actual information):
julia> (VectorizationBase.cache_size(Val(1)) + VectorizationBase.cache_size(Val(2))) ÷ 8
169984
The local L1 and L2 caches can hold almost 170_000 Float64
. To make their way to another CPU, they would have to migrate through the L3 cache.
These are fairly large problem sizes cores can work on on their own, without needing to dip into L3.
This means that a granular, breadth-first, threading approach will often maximize cache friendliness:
Let the inner, potentially-threaded, code ask: are other cores busy?
If they aren’t, get some speedup by using them.
If they are, minimize overhead and maximize cache locality by running single threaded.
I’ve generally found that such granular/breadth-first threading works better.
If you’re fitting a model with Markov Chain Monte Carlo (MCMC), focus on fitting chains in parallel first and foremost. Although, if the overhead were low-enough, this would be the ideal case for composable threading. I’d err on the side of preferring breadth-first over depth-first:. people often only want 4 or so chains, so divide up the available threads among these cores to give each an allotment of 4 or so that they can run on.
If running a Monte Carlo simulation, where you’re generating datasets and fitting them with MCMC, then I prefer to have the MCMC entirely single threaded and parallelize at the more granular level of datasets. This gets great locality – each dataset gets to sit in a cores private cache for as long as it is needed.
Re sharing L3 cache, Octavian.jl will use the shared L3 cache (or L2 on ARM with only 2 cache levels, in which the L2 is shared and L1 private – but this needs testingl), but this involves a lot of synchronization, meaning synchronization overhead has to be minimal.
Yes, I should have said LoopVectorization.vmap
instead of map
.
Referencing the earlier example, ThrradsX.map
is a convenient way to run MCMC chains in parallel, or do the Monte Carlo simulations of MCMC fits, and it’d be dangerous to assume that none of those thread, and of course they can potentially take a long time.
LoopVectorization.vmap
is dumb, and doesn’t actually use any of the LoopVectorization infrastructure.
I should probably make a breaking release where I move it elsewhere.
It heuristically assumes a base case of 32W
, where W
is the SIMD vector width.
But yes, I do plan on adding support for these threads to LoopVetorization, where its cost modeling would decide the base case.
But I’m planning on doing that at the same time as a long-promised rewrite of much of the internals. I’ve had a few starts on that, but more demanding things keep coming up. But I hope to return fairly soon.
Or maybe set aside a 4-hour or so block of time each day to work on it and disallow outside distractions.
Yes, Octavian does even better than MKL for these sizes.
But, that does make me wonder about API/implementing breadth-first through a Julia ecosystem.
These, and the vmapt!
example from earlier, are the sorts of things it’d be nice to parallelize on your local allotment of 4 cores while running 4 chains of MCMC on a 16+ core system.
For now, I’ll implement a AvailableTasks
type that indicates type that can be explicitly passed around to indicate which threads are actually available for use to spawn on.
Sample benchmarks at 48x48:
julia> using BLASBenchmarksCPU, Octavian
[ Info: Precompiling BLASBenchmarksCPU [5fdc822c-4560-4d20-af7e-e5ee461714d5]
julia> M = K = N = 48;
julia> A = rand(M,K); B = rand(K,N); C2 = @time(A*B); C1 = similar(C2);
0.000059 seconds (2 allocations: 18.078 KiB)
julia> @benchmark matmul!($C1,$A,$B) # Octavian.jl, Julia implementation
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 1.226 μs (0.00% GC)
median time: 1.293 μs (0.00% GC)
mean time: 1.300 μs (0.00% GC)
maximum time: 6.446 μs (0.00% GC)
--------------
samples: 10000
evals/sample: 10
julia> 2e-9M*K*N / 1.226e-6 # calculate GFLOPS
180.41109298531813
julia> @benchmark gemmmkl!($C1,$A,$B) # MKL
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 2.115 μs (0.00% GC)
median time: 2.134 μs (0.00% GC)
mean time: 2.139 μs (0.00% GC)
maximum time: 8.383 μs (0.00% GC)
--------------
samples: 10000
evals/sample: 9
julia> @benchmark gemmopenblas!($C1,$A,$B) # OpenBLAS
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 3.099 μs (0.00% GC)
median time: 3.113 μs (0.00% GC)
mean time: 3.120 μs (0.00% GC)
maximum time: 8.860 μs (0.00% GC)
--------------
samples: 10000
evals/sample: 8
julia> @benchmark gemmblis!($C1,$A,$B) # BLIS
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 39.319 μs (0.00% GC)
median time: 44.409 μs (0.00% GC)
mean time: 44.520 μs (0.00% GC)
maximum time: 103.083 μs (0.00% GC)
--------------
samples: 10000
evals/sample: 1
julia> @benchmark wait(Threads.@spawn nothing) # just spawning a thread and waiting on it
BenchmarkTools.Trial:
memory estimate: 442 bytes
allocs estimate: 4
--------------
minimum time: 1.829 μs (0.00% GC)
median time: 7.932 μs (0.00% GC)
mean time: 9.982 μs (0.00% GC)
maximum time: 56.803 μs (0.00% GC)
--------------
samples: 10000
evals/sample: 6
180 GFLOPS of Float64
is pretty good for such small sizes. On that computer (AVX512) MKL didn’t seem to start threading yet. It will with AVX2, but Octavian achieves better FLOPS at those sizes.
I’m curious about your ideas.