It depends;) We tried to discuss something similar recently resulting in some hints for me but without a final conclusion. OK, I have to admit that ThreadingUtilities.jl is new to me.
ThreadingUtilities.jl only provides a low level API, which packages like Polyester.jl, LoopVectorization.jl, and Octavian.jl use to provide more convenient APIs.
If you want one of the conveniently provided APIs, use them. If they don’t work for your use case, you can use ThreadingUtilities.jl directly (or perhaps in conjunction with PolyesterWeave.jl, which Polyester.jl, LoopVectorization.jl, and Octavian.jl also use).
If I pin the Julia threads, will it transfer to ThreadingUtilities.jl threads?
I can imagine running jullia -t nthreads with numactl or taskset.
There is also an approach used in ThreadPinning.jl, which is based on querying the cpu id with sched_getcpu and pinning the thread using uv_thread_setaffinity.
Let’s say I do repeatedly @batch per=thread for i=1:4000 with 4 threads. Will the range 1:1000 be always processed on thread 1, 1001:2000 – on thread 2 and etc.?
The mentioned packages just manage the available pool of Julia threads (specified via julia -t) and don’t create any new threads. So, if you pin the Julia threads by whatever method you’ve also pinned the “ThreadingUtilities.jl threads” because they are the same threads.
That’s a matter of the scheduling logic and what you describe is what e.g. @threads :static gives you. I believe @batch works in the same way but @Elrod will know better.
The reason I am asking about the static scheduling is the following. I want to make sure, that the set of cpu cores with the common L3 cash always process the same chunk of data.
So, first I need to pin the threads. (Which I understand how to do, thanks to your package, @carstenbauer)
Then, I need to schedule the tasks in such a manner, that the threads with the common L3 cash work on the same chunk of data.
Finally, I want to repeat the computations and make sure that this assignment of data chunk to thread group stays the same.
Yes, it does static scheduling.
If you have 4 threads total, it’ll be
Range
Threads.threadid()
1:1000
2
1001:2000
3
2001:3000
4
3001:4000
1
The first thread tells the others three to do work, and then begins on the last chunk itself.
It goes through PolyesterWeave.jl, so if you have many nested threaded programs that use only a few each (e…g. maybe you have 16 threads total but each place in your code only uses 4 threads), then of course it’s harder to predict where any particular set will land, but in general it’ll be patterns like this of all the leading threads first, and then the thread that actually ran the @batch code running the rest.
So if you had a 5950X, which has an L3 cache for cores 0-7, and a second for cores 8-15, the first and last 7 groups would run on the last L3, while the second through ninth would run on the first L3.
If this is too inconvenient, I’d accept a PR changing this.
But it’d ideally be accompanied by benchmarks.
Possible issues that could come up from changing this:
The first thread might get started last. By handling the remainder, it then does the least work to compensate. On the other hand, maybe it takes more time for other threads to properly get started because of latency in communication.
Maybe we should pay more attention to alignment of split chunks when iterating over arrays. Or at least have the option to preserve it.
It seems that another option is to combine distributed and threaded computing. We can create a julia process per L3 cache and pin the process threads to the corresponding cores. The threads in different L3 cashes would be isolated from each other by the fact that they belong to different processes.
The downside is, of course, that it is harder to share the data between the processes. However, if there is not a lot of communication between different thread groups (corresponding to different L3 cashes), then it should be ok.
@carstenbauer Do you think a function to add processes with pinned threads could be a useful addition to your ThreadPinning.jl?
I want to keep ThreadPinning.jl as slim as possible and focused on threads. So adding processes (via Distributed) isn’t part of its scope. But maybe there is (or should be) a package for managing processes (and threads). Feel free to start it