How to set up number of threads appropriately based on Hardware?


Hello,
Based on my computer’s performance, what is the maximum number I can set in Julia for:
1- Multiprocessors: is it addprocs(12)?
2- Multithreading: is it julia_num_threads=12?

Thanks,

Your hardware’s going to run max. 12 threads at a time, possibly in one program or several. It is going to run other programs in addition to Julia as well. My thinking would be that less than 12 would be optimal, but you should benchmark your code.

3 Likes

You can check Sys.CPU_THREADS.
Note that it isn’t type stable:

julia> systhreads() = Sys.CPU_THREADS
systhreads (generic function with 1 method)

julia> @code_warntype systhreads()
Variables
  #self#::Core.Const(systhreads)

Body::Any
1 ─ %1 = Base.Sys.CPU_THREADS::Any
└──      return %1

and also that whether it’s better to use the number of physical cores vs logical threads varies by application.
It’s also more complicated on CPUs with a mix of big and little cores. On the M1, which has 4 big and 4 little cores, I find much better performance when using 4 threads than with 8.

6 Likes

Thanks for your reply!

  • So, basically the number of threads (julia_num_threads) relies on the number of logical processors, right?
  • How is about the number of processes? it it also relies on the number of logical processors (i.e., addprocs())?
1 Like

Thanks for your reply!
I have the same of your output. Does this mean I have only one thread?

1 Like

No Chris’s point was just that the function isn’t type stable, which is what @code_warntype shows you. To get the number of threads, you just want to check the variable itself:

julia> Sys.CPU_THREADS
4
2 Likes

Incidentally, why is that? I would have expected that it is always an Int.

1 Like

I think because it is a non-const global, as it has to be initialized in the init block:

On the subject of getting more detailed information, Hwloc.jl provides some:

julia> Sys.CPU_THREADS
36

julia> Hwloc.num_virtual_cores()
36

julia> Hwloc.num_physical_cores()
18

julia> Hwloc.topology()
Machine (125.48 GB)
    Package L#0 P#0 (125.48 GB)
        NUMANode (125.48 GB)
        L3 (24.75 MB)
            L2 (1.0 MB) + L1 (32.0 kB) + Core L#0 P#0
                PU L#0 P#0
                PU L#1 P#18
            L2 (1.0 MB) + L1 (32.0 kB) + Core L#1 P#1
                PU L#2 P#1
                PU L#3 P#19
            L2 (1.0 MB) + L1 (32.0 kB) + Core L#2 P#2
                PU L#4 P#2
                PU L#5 P#20
            L2 (1.0 MB) + L1 (32.0 kB) + Core L#3 P#3
                PU L#6 P#3
                PU L#7 P#21
            L2 (1.0 MB) + L1 (32.0 kB) + Core L#4 P#4
                PU L#8 P#4
                PU L#9 P#22
            L2 (1.0 MB) + L1 (32.0 kB) + Core L#5 P#8
                PU L#10 P#5
                PU L#11 P#23
            L2 (1.0 MB) + L1 (32.0 kB) + Core L#6 P#9
                PU L#12 P#6
                PU L#13 P#24
            L2 (1.0 MB) + L1 (32.0 kB) + Core L#7 P#10
                PU L#14 P#7
                PU L#15 P#25
            L2 (1.0 MB) + L1 (32.0 kB) + Core L#8 P#11
                PU L#16 P#8
                PU L#17 P#26
            L2 (1.0 MB) + L1 (32.0 kB) + Core L#9 P#16
                PU L#18 P#9
                PU L#19 P#27
            L2 (1.0 MB) + L1 (32.0 kB) + Core L#10 P#17
                PU L#20 P#10
                PU L#21 P#28
            L2 (1.0 MB) + L1 (32.0 kB) + Core L#11 P#18
                PU L#22 P#11
                PU L#23 P#29
            L2 (1.0 MB) + L1 (32.0 kB) + Core L#12 P#19
                PU L#24 P#12
                PU L#25 P#30
            L2 (1.0 MB) + L1 (32.0 kB) + Core L#13 P#20
                PU L#26 P#13
                PU L#27 P#31
            L2 (1.0 MB) + L1 (32.0 kB) + Core L#14 P#24
                PU L#28 P#14
                PU L#29 P#32
            L2 (1.0 MB) + L1 (32.0 kB) + Core L#15 P#25
                PU L#30 P#15
                PU L#31 P#33
            L2 (1.0 MB) + L1 (32.0 kB) + Core L#16 P#26
                PU L#32 P#16
                PU L#33 P#34
            L2 (1.0 MB) + L1 (32.0 kB) + Core L#17 P#27
                PU L#34 P#17
                PU L#35 P#35

But it’s no help telling big vs small cores (AFAIK):

julia> versioninfo()
Julia Version 1.8.0-DEV.92
Commit d1145d4569* (2021-06-29 01:41 UTC)
Platform Info:
  OS: macOS (arm64-apple-darwin20.5.0)
  CPU: Apple M1
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-12.0.0 (ORCJIT, cyclone)
Environment:
  JULIA_NUM_THREADS = 4

julia> Sys.CPU_THREADS
8

julia> Hwloc.num_virtual_cores()
8

julia> Hwloc.num_physical_cores()
8

julia> Hwloc.topology()
Machine (3.41 GB)
    Package L#0 P#0 (3.41 GB)
        NUMANode (3.41 GB)
        L2 (4.0 MB) + L1 (64.0 kB) + Core L#0 P#0
            PU L#0 P#0
        L2 (4.0 MB) + L1 (64.0 kB) + Core L#1 P#1
            PU L#1 P#1
        L2 (4.0 MB) + L1 (64.0 kB) + Core L#2 P#2
            PU L#2 P#2
        L2 (4.0 MB) + L1 (64.0 kB) + Core L#3 P#3
            PU L#3 P#3
        L2 (4.0 MB) + L1 (64.0 kB) + Core L#4 P#4
            PU L#4 P#4
        L2 (4.0 MB) + L1 (64.0 kB) + Core L#5 P#5
            PU L#5 P#5
        L2 (4.0 MB) + L1 (64.0 kB) + Core L#6 P#6
            PU L#6 P#6
        L2 (4.0 MB) + L1 (64.0 kB) + Core L#7 P#7
            PU L#7 P#7
4 Likes

Thank you very much all! Now, I know the maximum available number of threads in my computer.

Can I consider the number returned by (Sys.CPU_THREADS) as the maximum processes that I can define in Julia as well, i.e. addprocs(11)+Master node=12 processes in total?
In other words, does the concept of threads is similar to process workers in Julia?

addprocs defaults to Sys.CPU_THREADS.
Distrubted normally runs code on the worker processes, so you’d want addprocs(12) for 12 workers.
You can add however many workers you want (memory allowing), but you’ll probably get the best performance with 6 or 12.

2 Likes

The processor you have will only ever run 6 things at once. The threads are for “hyperthreading” which may occasionally allow your CPU to switch rapidly between running one thing on a core and running another on a core. This can reduce context switching time and allow the CPU to be utilized somewhat more efficiently but it only usually is a benefit when you have a lot of cache misses or other stalls.

For efficient numerical code the hyperthreading rarely helps much and can even hurt. So you should try both 6 and 12 and see what goes faster for your workload.

3 Likes
  • So, setting a proper number of threads in Julia should be based on the number of logical processors in the computer, right? in my case 12 as I have
    julia> Sys.CPU_THREADS
    12

  • The above is also true for number of worker processes, right?

Is this because processes need to have their own memory partitions, so higher number of processes will lead to higher memory occupation which should not exceed its capacity, right?

No it should be based on what you want to accomplish. For example a friend was running some MCMC procedures while editing his manuscript. He had 6 cores and 12 threads. I advised him to run 4 chains on 4 threads so that he had two real cores still available for interaction while editing the manuscript.

If he had run 12 threads his machine would have been unusable for editing. Even if he’d run 6 threads it would have been no interactive because of the 12 hyperthreads only 6 can run at any one time.

2 Likes

Is this means that the 6 threads are running on 6 cores (one thread in each core), thus there are no available core for editing (in your example)?

Yes more or less, and if you make 12 threads they still only have 6 running at any time

Since __init__ is run only once after module loading, then eval might be OK?

module Test
export test, test2

function __init__()
	val = rand(1:10)
    global TEST = val
    @eval const global TEST2 = $val
end

test() = TEST
test2() = TEST2
end

What do you get when calling hwloc directly from the CMD? The big/small core information might be available in the objects properties and we’re just not printing it. Maybe worth trying to extract a core from the topology and looking at its fields, i.e. something like collectobjects(:Core, gettopology())[1].attr. (For me the output is empty though.)

Is there a heuristic for which kind of application I have, or is it unpredictable?

Apparently, since hwloc Version 2.4 lstopo seems to have a --cpukinds option. And in 2.6 they have specifically worked on distinguishing high and low performance cores for M1 mac’s, see hwloc/NEWS at master · open-mpi/hwloc · GitHub . I’ll take a look and will try to update Hwloc.jl accordingly tomorrow.

(Update: Try to support --cpukinds information · Issue #57 · JuliaParallel/Hwloc.jl · GitHub)

2 Likes