Hey, just found that thread now. Is there a reason why BLAS.set_num_threads(n)
defaults to 8 threads and also ignores Julia’s --threads
option? I found that issue while investigating that matrix multiplication benchmark: https://github.com/kostya/benchmarks/issues/312 - the benchmark does not set the number of threads and BLAS defaults to 8 on the 12 thread test machine while numpy uses all threads. Same on my machine: It defaults to 8 threads as well, while my CPU has 16 threads and load does not exceed 60%. I tried to set it with the --threads
option and then with the regular thread environment variable first. It would probably be a less tricky thing to acknowledge those options when set and use BLAS.set_num_threads(n)
only to override those settings(?)
BLAS’ threading is completely independent from Julia’s threading, there is no reason to tie them together.
I also ran into the BLAS auto-threading recently and was quite surprised that it isn’t limited by the -t
option. One way I viewed Julia’s -t
option so far is that it limits the number of threads used by a Julia program. This can be useful in situations where you want to run multiple Julia processes on a multi-core system, without interference due to too many threads being used in total.
One issue with tying them together is that you might want to limit the number of Julia threads in order to get the most out of threading with the BLAS library, or vice-versa (limit BLAS threads since you plan to use them more effectively on the Julia side). So often if one is small you might want the other to be large. So I don’t think it makes sense to tie them to the same flag.
So maybe it’s more of a documentation issue than a “what should -t
do” issue. E.g. Multi-Threading · The Julia Language doesn’t mention BLAS at all.
Clearly, the solution to this is having our own super fast Julia-BLAS (see Octavian.jl etc.).
I think, like I wrote before, completely tying them together wouldn’t be necessary: One could use the BLAS
setting as an override to the --threads
option. I.e.:
--threads
not set andBLAS.set_num_threads(n)
not set: Use either all available threads for BLAS or maybe just 1 thread/Julias default number of threads to keep it consistent. Do not use 8 threads regardless of the machine architecture by default.--threads
set andBLAS.set_num_threads(n)
not set: Use the number of threads passed to--threads
for everything including BLAS--threads
not set andBLAS.set_num_threads(n)
set: Use the number of threads set for BLAS for BLAS and for the rest of Julia the defaults (=behaviour like it is now)--threads
set andBLAS.set_num_threads(n)
set: Use--threads
for Julia and the BLAS option for BLAS (=behavior like it is now)
This way you have one or two pitfalls less and are still able to control both thread counts individually if you feel like doing so.
I just think the current behaviour of Julia is counterintuitive and BLAS defaulting to use 8 threads regardless of the machine architecture also seems like an odd choice and a pitfall. I read somewhere here that this was done because most users had 4 core machines with 8 threads so 8 threads was chosen, but why not simply fetch the thread count of the current machine and use that instead like numpy seems to be doing?
If those are all faster than Julias internal implementation and circumvent those issues is there a reason to not make any of those the default implementation in Julia? (As you might have guessed it, I am new to Julia and Julia’s community)
Architecture specific optimizations and larger matrices are (as far as I know) still places where regular BLAS shines. It’d also be a considerable amount of effort to integrate one of those packages into Base and make sure it’s all compatible and doesn’t introduce regressions.
That said, I think it’s reasonably likely that 2 years from now we will have this integrated. The biggest challenge is that doing this requires all of the stack for it be very stable.
I don’t think that’s the case? According to BLAS threads should default to physical not logical core count? · Issue #33409 · JuliaLang/julia · GitHub it chooses it based on Sys.CPU_THREADS
(and that issue is about how it should be based on the number of cores, not the number of threads).
It does max out at 8 though.
julia> BLAS.get_num_threads()
8
julia> Hwloc.num_physical_cores()
20
Also, welcome @Manuel_Bergmann! I think its worth splitting this topic into a dedicated discussion since it’s valuable in its own right.
I’m not the most qualified person to comment on this, but perhaps part of the reason that they are separate is because BLAS’s scheduler (?) and Julia’s scheduler (?) don’t currently talk to each other in the way that two threaded pieces of Julia code somehow compose efficiently magically and for free (from the perspective of an end-user like me). Another word that I don’t understand but that seems to come up is partr.
I’ve been bitten several times by, for example, wanting to thread-map a bunch of cholesky!
calls to a Vector{Symmetric{...}}
. It was meaningfully slower than the single threaded map call, and after some help from @tro3 and the excellent logging abilities of ThreadPools.jl
I came to understand that the BLAS threading stuff and the Julia threading stuff don’t really play nicely with each other. So at the very least until that’s resolved, I absolutely love being able to BLAS.set_num_threads(1)
when I want to ThreadPools.tmap(cholesky!, large_vec_of_small_matrices)
, and then crank the BLAS
threads back up when I’m working with big matrices. But it’s definitely a gotcha until you realize what’s going on.
I like the principle of least surprise and BLAS indeed violates it a bit. Also, the BLAS multi-threading is fundamental to many packages using operations on (large) matrices, yet those are not all uses of Julia I would expect. Can’t hurt to make a mention of it in the docs.
Related to the thread counts and configuation of blas. I have noticed that there is a note on BLAS in https://github.com/JuliaLang/julia/blob/master/NEWS.md#linearalgebra and in particular https://github.com/staticfloat/libblastrampoline/
Does this mean that in 1.7 it will be easy to swap out BLAS implementations instead of relying on MKL.jl/etc.?
That’s the idea. With a little luck Octavian
(BLAS in Julia) will also be easy to plug in.
Well, using MKL
will be the way to dynamically switch to MKL. But yes, it’s entirely different to what MKL.jl did before.
Another option to note here: Strided.jl has an option to replace the BLAS threads with Julia’s threads: GitHub - Jutho/Strided.jl: A Julia package for strided array views and efficient manipulations thereof .
I believe this conversation underlines my point pretty well. And that behavior (if I got this right: it automatically chooses the correct number of threads but only up to 8 and then it just chooses 8) honestly sounds even worse than expected, because it won’t be an issue until you try your code on a machine with more than 8 cores and it can be really tricky to identify. It is yet another thing you simply need to know and it’s really not intuitive. I assume, most people will just run straight into that, wasting time searching for why their code does not fully utilize their CPU. Especially beginners who want to play with the language and find out if they want to stick with it. But from the conversation above it seems that this is not something even more experienced people necessarily seem to be aware of (the upper limit).
Side note: Maybe explaining this issue should go into the performance tips ? I think it would be a good fit there.
I think this discussion shouldn’t be, how easy it is to circumvent the problem, but how many beginners, students etc. (like me) will run into this issue. If the mechanism of how different BLAS packages are loaded will change in Julia 1.7, this might be a good opportunity to change this point as well(?)
I think the reason that it maxes out at 8 is due to OpenBLAS, not Julia. It probably does not happen if you use MKL instead. I don’t know the reason for this, maybe the OpenBLAS threading becomes inefficient above 8 threads.