Why is -t auto not the default for auto-threading? And what to do about it? Would users want it by default?

I realize I can start my program like this:

$ julia -t auto test.jl

but I would also want to be able to just click on the file to run it (e.g. in Windows, where a file extension should run it with julia for me) or run like, and get threading:

$ ./test.jl

I can do that in Linux with a shebang, i.e… the first line of the main .jl program file being something like this (or something shorter), and such is ignored on Windows:

#!/home/pharaldsson/.julia/juliaup/bin/julia -t auto

But I can’t nor do I want to have to run it like this:

$ ./test.jl -t auto

If people are worried about changing the default to auto, then it’s in the hands of users to run programs like:

$ julia -t 1 test.jl

or control with the ENV var, or at least on Linux/non-Windows with this first line:

#!/home/pharaldsson/.julia/juliaup/bin/julia -t 1

if you want to have threading control in the programmers hands (where it seems like it should be).

I think I know the pros and cons. At one point the threading API was experimental, so I guess it was considered safer to have it opt-in. Given the potential cons, would people think such a change needs to wait for 2.0? I think that’s overkill, should just be documented in NEWS? There was however notification recently that seems not to apply, since it applies with or without threads (how might it be fixed; in 2.0 only?):

This is not actually a problem with multithreading specifically, but really a concurrency problem, and it can be demonstrated even with a single thread. For example:

$ julia --threads=1

By now it seems like a misfeature to start with just 1 thread. Threads are cheap to make, auto gives me 16 on my computer (same as virtual cores), and a potential 16x speedup, and it will grow larger with time as computers get more cores.

For those that don’t know, threads open the door to race conditions, i.e. bugs too, but only if you actually use them; and by “you”, I mean if your whole program doesn’t use threads i.e. nor any of your dependencies, then you’re safe. On the flip side, if e.g. you don’t use threads in your “program” (e.g. are ignorant of them, computer science), but one or more of your dependencies use, you are missing out on the speed of those (hopefully) correctly codes libraries.

If none of the code uses threads, which is actually opt-in, in the code, then you are just making 15 extra threads, very quickly, that just sit around not wasting any resources to speak of.

$ hyperfine 'julia -e ""'
Benchmark 1: julia -e ""
  Time (mean ± σ):     212.9 ms ±  13.7 ms    [User: 146.9 ms, System: 86.1 ms]
  Range (min … max):   186.2 ms … 225.9 ms    14 runs

$ hyperfine 'julia -t auto -e ""'
Benchmark 1: julia -t auto -e ""
  Time (mean ± σ):     212.5 ms ±  12.6 ms    [User: 148.7 ms, System: 97.2 ms]
  Range (min … max):   189.9 ms … 229.1 ms    13 runs

Now one other issue is that one one computer you might get 4 threads, and on a larger one 16, or 256. Which in the best case means faster, in the worse case slightly different results, non-bit identical, that just comes with the territory of parallelism. Also if you do I/O in a threaded loop, be design, it’s non-deterministic order. Why you shouldn’t be doing that…

I believe all of the package ecosystem should be tested with threads, i.e. where it applies. Why opt into using them in the code if you do not take the responsibility for the code and correctness. If people are really worried, I could see something like (compiled) packages using auto, and regular packages limited to one thread. Or vise versa, or possibly individual packages opting in or out. Most naturally the opt out by default by not using @threads in their code, but their packages could use, or callers, e.g. regular user code.

For some reason threads can slow startup, but I believe it’s ionly when you oversubscribe, that auto could never do:

$ time julia -t 32 -e ""

real	0m0,258s
user	0m0,435s
sys	0m0,235s

$ time julia -t 320 -e ""

real	0m0,734s
user	0m7,111s
sys	0m0,367s
2 Likes

I would like it to be the default. In fact, VScode it is the default.

3 Likes

I prefer the current default. First, hyperthreading can be slower than setting the number of threads to the number of physical cores. -t auto will use the number of virtual cores. More importantly, I often start a Julia program on a machine where I will continue to do interactive work. I don’t want the program to use all the cores.

2 Likes

I new default for “auto” could be number of physical cores (meaning N/2 of hyperthreaded virtual cores).

You could still do that, just wouldn’t be the default, and if number of physical cores is too much for you, do you think you would agree with N/2 of physical cores (N/4 of virtual)? Actually using all the cores shouldn’t be a problem, ideally the OS should take care of giving other programs access, so what OS do you use? Instead of a fraction, do you think use all cores except 1 or 2 might be a good idea, or some hybrid of that and a fraction?

Note, as commented, VS Code does use the (current) “auto”, so is that already problematic for you?

I think if we are going to introduce a new default for Julia (in 1.10) we could be conservative, and could tune this further later.

I use Linux. Using other programs interactively is not prevented, but there are lags. For example, when Julia is busy compiling a big package, for which it uses the whole machine, it interferes with other work.

I don’t use use VS Code.

I know I can “still do that”; I just prefer the default the way it is now.

Did you try to set the environmental variable JULIA_NUM_THREADS to the default that you like: Environment Variables · The Julia Language ?
Not sure if that is taken into account when you double-click on a .jl file in Windows, worth to try…

See Slack

One reason -t auto can’t just be made the default (that is, without thinking it through carefully) is that we currently also start plenty of OpenBLAS threads by default. That’s fine if we only have a single Julia thread but generally a terrible idea when Threads.nthreads() > 1. Otherwise put: What’s more important (as a default), multithreaded Julia or multithreaded BLAS?

Well, the issue is how to know whether SMT (“hyperthreading”) is activated/available on the system in the first place. Julia currently can’t figure these things out. We’d need to add a dependency for this like hwloc.

I can not say for certain in your case, but it is very probable that the lag comes because of RAM/SWAP overhead during compiling, not because of multithreading.

1 Like

On Linux, most shells would allow you to create an alias.

$ julia -E "Threads.nthreads()"
1
$ alias julia='julia -t auto'
$ julia -E "Threads.nthreads()"
8

You can put the line alias julia='julia -t auto in your ~/.bashrc.

1 Like

I think these things should be handled by the package developer, and not by the user. Everything should be started by default with potential multi-threading, and a package that uses Julia threading and calls BLAS routines from within the threaded code must deal with this internally to obtain the performance it wants.

Sometimes people ask here why some code gets slower with threading, and learns that it is because it competes with BLAS threads. This is a user which is already trying to play with parallelism and unavoidably will have to deal with these things.

  • The question here is about the Julia default, not what packages do. (Think e.g. LinearAlgebra)
  • Note that this can’t really be handled on the package level. For one, the package can’t know if the user is itself calling library functions from multiple threads (BLAS multithreading just isn’t composable and must be fine tuned for the specific application). Also setting the number of BLAS threads isn’t package local but a global setting.

The issue is the lack os composability of BLAS threads, and requires tunning in any case. I don’t see how limiting one or both in the default settings helps the user. The user will in any case reach situations in which manual intervention is necessary.

The “naive” user in any case already starts Julia multi-threaded when searching for performance.

Showing up both the number of Julia and BLAS threads in the opening message is the only helpful thing I can see.

1 Like

I really feel your pain, but you’re describing a different issue that doesn’t apply to -t auto, I believe. So I want to explain that one and why I think it doesn’t apply to threading.

Julia takes a lot of memory, and when you have many processes (not threads), you multiply RAM-use by process count. When you [pre]compile even one package you most often precompile other packages, i.e. its dependencies too (and what is worse, I’ve seen unrelated packages updated and also thus precompiled needlessly). and Distributed is used, i.e. more Julia processes (I could in theory see rather threads used for the parallel precompilation). You can limit the parallel precompilation, or disable entirely with an ENV, which I don’t do, though most often I just kill it interactively by pressing CTRL-C, since yes it can be a problem, hang my machine for a long time or OOM.

Threads on the other hand serve a similar purpose (parallelism), but work within the same process. E.g. if you have a loop, that doesn’t allocate, then adding the @threads macro too it doesn’t allocate any more memory. You will not get any more swapping/paging, nor use the shared caches, while you will use more cores thus L1 cache of more than one core. I think it doesn’t have bad effects, at least not nearly as the other scenario, and can’t OOM.

If however the for loop would allocate memory (or call a function that does), then you have a known problem with the GC. Ideally the GC will keep up and deallocate as fast as you allocate, i.e. no slower than single-threaded would. There’s a recent discussion about it here on discourse. I believe a PR with a fix will lang any day now. But such code should ideally not be put into production, if you know threading produce this problem, you wouldn’t put it in your code/package. I want to stress again that starting Julia with (many) threads or auto, doesn’t really have any effect to speak of if your code doesn’t actively use threads. It seems the fear may be overblown since auto is the default in VS Code and lots of people use it for their code, often with lots of different packages from the ecosystem, many must uses threads,

Do you still think changing from -t 1 as the default is the better default? Others may disagree, and think they’re missing out on speed-up, as soon as they learn of this they may enable -t auto. And ignorant people will keep using the old slower default.

No new default number of threads is safe (nor even the current default of 1, in some very unusual situations). But that also means everyone opting into a non-default currently on e.g. 1.9 needs to know it has potential issues, and that programmer user currently takes the responsibility.

But I think here’s a way to have your cake and eat it too. I already made a PR to make -t auto the default, well actually -t auto,1 since I decided last minute that having that interactive thread might be a better default. That was before people brought some issues up, and I agree threads are unsafe, actually never will be fully safe until Julia radically changes.

So this PR was made a draft. I think the solution is to enable -t auto,1 is a restricted sense, i.e. that you will STILL be limited to 1 thread by default and Julia would just lie to you, and maybe forever return this:

julia> using .Threads; nthreads()
1

But if the programmer opts into the threads, and takes the responsibility, with enable_threads() then the program will see all the threads, and you may get a different result (nthreads() is seemingly though a bad API, that we want to drop for 2.0, so I’m not sure). This is the simplest change I can come up with, that will be accepted right away, and maybe for 1.10, since it doesn’t really change anything…, but I explain the issue and a proposal for SafeThreads module here:

That’s an alternative and/or possibly combined with the above simple change.