What is the equivalent of this OpenMP Directive in Julia?

What is the equivalent to the below command in Julia, in which I can run a specific code (or function) on a specific CPU core?

!$OMP PARALLEL
    # Code
!$OMP END PARALLEL

That OpenMP directive doesn’t run anything on a “specific CPU core”. It tells OpenMP to execute the given code in parallel using a team of threads (and wait until all threads have completed), but how these are scheduled onto CPU cores is up to the runtime and operating system.

If you just have a block of code in Julia that you want to run N times on all N threads, I guess the closest equivalent would be something like:

tasks = [Threads.@spawn begin
    # code
end for _ = 1:Threads.nthreads()]
foreach(wait, tasks)

For example, running with julia -t 4 (4 threads):

julia> tasks = [Threads.@spawn begin
           sleep(10); println("hello from thread ", Threads.threadid())
       end for _ = 1:Threads.nthreads()]
4-element Vector{Task}:
 Task (runnable) @0x000000010a0aa400
 Task (runnable) @0x000000010a0aa570
 Task (runnable) @0x000000010a0aa6e0
 Task (runnable) @0x000000010a0aa850

julia> foreach(wait, tasks)
hello from thread 2
hello from thread 3
hello from thread 4
hello from thread 1
3 Likes

Or much more simply:

Threads.@threads for i in 1:N
    # Do work
end

The Threads.@threads macro will distribute the loop into N / Threads.nthreads() chunks and assign this work to your active threads. The threads need to be created first, e.g. by starting julia with --threads [N|auto]

As @stevengj says, though, this won’t give you the ability to assign to a specific core, although I believe you can set thread affinity by launching julia with the environment variable JULIA_EXCLUSIVE=1 which might get you some way there (not sure exactly what you’re trying to do there though).

1 Like

If you want to pin Julia threads to specific cores, you might want to check out GitHub - carstenbauer/ThreadPinning.jl: Pinning Julia threads to cores (disclaimer: I’m the author).

If you then want to run code on a specific thread (and core) you can use its @tspawnat macro.

3 Likes

Thank you very much! Your explanations are really helpful.

1- I know that sleep() is to block the current task for a specified number of seconds. But I didnt know/follow the exact usage in this case (i.e., before println), even I have deleted it and got the below. Does it make a barrier? Maybe some comments on the flow of execution of the code would be very helpful.

julia> tasks = [Threads.@spawn begin
                  println("hello from thread ", Threads.threadid())
                         end for _ = 1:Threads.nthreads()]
hello from thread 2
hello from thread 1
hello from thread 3
hello from thread 4
6-element Vector{Task}:hello from thread 6

 hello from thread 5
Task (done) @0x000000009cef7840
 Task (done) @0x000000009cef7a30
 Task (done) @0x000000009cef7c20
 Task (done) @0x000000009cef7e10
 Task (done) @0x000000009c9cc010
 Task (done) @0x000000009c9cc200

2- Since tasks = [...] can give the same results, what is the meaning of using foreach(wait, tasks)?

Thank you very much!

Can I put it at the beginning of my julia code as below before running it the REPL?

ENV["JULIA_EXCLUSIVE"] = 1

Thanks for your note! Will it be available for Windows soon?

I only put it there to make each task run for a little while, so that the foreach(wait, tasks) had a chance to wait for them to complete. (Otherwise they run so quickly that the tasks are already done by the time I run wait.)

Julia uses a dynamic scheduler, not a static schedule. You spawn as many tasks as you want (often many more than you have threads) in order to expose the potential parallelism of your program, and then the runtime scheduler will assign these tasks to threads.

In this particular case, it seems your tasks are running so quickly that the runtime scheduler may have time to migrate them to all the threads.

(One big advantage of a dynamic scheduler is that it can load-balance irregular tasks, whereas a static schedule only works well if you know ahead of time how to equally divide the computational effort. Another big advantage is that parallelism is composable — your code can spawn multiple tasks, and a library routine can also spawn multiple tasks that you don’t know about, and the runtime scheduler will load-balance all of them together.)

It looks like you never ran anything on thread 1, so S[1] is left at whatever you initialized S to?

1 Like

Oops, I forgot to mention the initialization. For this reason, I putting it again (on 6 threads):

S = zeros(nthreads());
N = zeros(nthreads());
@threads for i in 1:nthreads()
    println("hello from thread ", threadid())
    S[i] = threadid();
    N[threadid()] = threadid();
end
hello from thread 2
hello from thread 5
hello from thread 6
hello from thread 4
hello from thread 5
hello from thread 3
julia> S
6-element Vector{Float64}:
 3.0
 5.0
 2.0
 6.0
 3.0
 4.0

julia> N
6-element Vector{Float64}:
 0.0
 2.0
 3.0
 4.0
 5.0
  • From the printed hello..., I see that threadid=5 has addressed two iterations, so number 5 should appear twice in S not umber 3, right?
  • Here it is clear that N[1]=0 because threadid=1 has not been assigned any iteration.

I don’t see how you can tell which iteration occurred for which value of i, since you didn’t print that (and the println statements will not necessarily execute in order of i, since they are parallel) … maybe change it to println("hello $i from thread ", threadid()).

(Technically, it may be possible for the loop body to migrate from one thread to another while it executes, so you have a bit of a race condition here, but I guess that’s not too likely for such a small/fast loop body.)

1 Like

Why do you care about the affinity to particular CPUs? My feeling is that this is mostly useful for benchmarking (since it makes the performance more consistent).

1 Like

No that won’t work, you need to set the environment variable before starting the Julia process. E.g. JULIA_EXCLUSIVE=1 julia -t4

But I agree with Steven here: very likely, you shouldn’t care about the thread-core mapping.

1 Like

Maybe my understanding for the execution is not correct. Therefore, I made the change based on your comment as below. Please correct me if my explanations for the below results are not true.

  • From the println statements, thread 2 has executed two iterations when i equals 2&3. So, number 2 has to be saved in S[2]&S[3], which is true for S[3] but not for S[2] (=3).
  • Following the same logic, S[4] (should =5 not 6), S[5] (should =6 not 5), and S[6] (should =3 not 4).
S = zeros(nthreads());
@threads for i in 1:nthreads()
    println("hello from $i thread ", threadid())
    S[i] = threadid();
end
hello from 1 thread 4
hello from 3 thread 2
hello from 2 thread 2
hello from 5 thread 6
hello from 4 thread 5
hello from 6 thread 3

julia> S
6-element Vector{Float64}:
 4.0
 3.0
 2.0
 6.0
 5.0
 4.0

I think this is what happened above that there is a migration between the println and the assignment statements, right?

I am thinking to spawn my main code over the 1:nthreads()-1 thread and reserve the last thread to handle the rest code. So, I can reduce the overlap?

Any feedbacks here please?