With the caveat that I haven’t used or contributed to KernelAbstractions.jl myself, it looks like the implementation for CPU is here: KernelAbstractions.jl/src/cpu.jl at main · JuliaGPU/KernelAbstractions.jl · GitHub
As you can see, it spawns Threads.nthreads()
tasks and partitions the iteration space evenly between them. If the static_threads
parameter is true, the tasks are spawned using Threads.@threads :static
such that they are prevented from migrating between threads; otherwise, @sync
and Threads.@spawn
are used.
This is almost exactly how Threads.@threads
itself works, with or without the :static
parameter. It’s just reimplemented here in a way that fits with the internal abstractions and interfaces in KernelAbstractions.jl. Certainly, it only uses the threads that Julia was started with, there’s no dark magic to create its own thread pool or anything like that.
As for overhead, it looks like KernelAbstractions.jl will have the same overhead as Threads.@threads
when static_threads
is true and as @sync
+ Threads.@spawn
otherwise. In the past, I’ve measured Threads.@threads
to have slightly less overhead than the equivalent @sync
+ Threads.@spawn
, which I think is because the @sync
mechanism allocates a Channel
to hold all the tasks. However, the difference shouldn’t be significant assuming your computation is large enough to warrant multithreading in the first place.