ThreadingUtilities.jl only provides a low level API, which packages like Polyester.jl, LoopVectorization.jl, and Octavian.jl use to provide more convenient APIs.
If you want one of the conveniently provided APIs, use them. If they don’t work for your use case, you can use ThreadingUtilities.jl directly (or perhaps in conjunction with PolyesterWeave.jl, which Polyester.jl, LoopVectorization.jl, and Octavian.jl also use).
If I pin the Julia threads, will it transfer to ThreadingUtilities.jl threads?
I can imagine running jullia -t nthreads with numactl or taskset.
There is also an approach used in ThreadPinning.jl, which is based on querying the cpu id with sched_getcpu and pinning the thread using uv_thread_setaffinity.
Let’s say I do repeatedly @batch per=thread for i=1:4000 with 4 threads. Will the range 1:1000 be always processed on thread 1, 1001:2000 – on thread 2 and etc.?
The mentioned packages just manage the available pool of Julia threads (specified via julia -t) and don’t create any new threads. So, if you pin the Julia threads by whatever method you’ve also pinned the “ThreadingUtilities.jl threads” because they are the same threads.
That’s a matter of the scheduling logic and what you describe is what e.g. @threads :static gives you. I believe @batch works in the same way but @Elrod will know better.
Yes, it does static scheduling.
If you have 4 threads total, it’ll be
The first thread tells the others three to do work, and then begins on the last chunk itself.
It goes through PolyesterWeave.jl, so if you have many nested threaded programs that use only a few each (e…g. maybe you have 16 threads total but each place in your code only uses 4 threads), then of course it’s harder to predict where any particular set will land, but in general it’ll be patterns like this of all the leading threads first, and then the thread that actually ran the @batch code running the rest.
So if you had a 5950X, which has an L3 cache for cores 0-7, and a second for cores 8-15, the first and last 7 groups would run on the last L3, while the second through ninth would run on the first L3.
If this is too inconvenient, I’d accept a PR changing this.
But it’d ideally be accompanied by benchmarks.
Possible issues that could come up from changing this:
The first thread might get started last. By handling the remainder, it then does the least work to compensate. On the other hand, maybe it takes more time for other threads to properly get started because of latency in communication.
Maybe we should pay more attention to alignment of split chunks when iterating over arrays. Or at least have the option to preserve it.
It seems that another option is to combine distributed and threaded computing. We can create a julia process per L3 cache and pin the process threads to the corresponding cores. The threads in different L3 cashes would be isolated from each other by the fact that they belong to different processes.
The downside is, of course, that it is harder to share the data between the processes. However, if there is not a lot of communication between different thread groups (corresponding to different L3 cashes), then it should be ok.
@carstenbauer Do you think a function to add processes with pinned threads could be a useful addition to your ThreadPinning.jl?
I want to keep ThreadPinning.jl as slim as possible and focused on threads. So adding processes (via Distributed) isn’t part of its scope. But maybe there is (or should be) a package for managing processes (and threads). Feel free to start it