The documentation says I can specify number of threads by using the JULIA_NUM_THREADS variable. Often, during testing, I need to experiment different values. Having to restart the REPL every time make it quite inefficient. Is there any way to set the number of threads interactively?
No, there is not.
The threading support is (as documented) experimental,
and will be fully replaced in the 1.x timeframe
I know it says experimental on the manual but I’ve gotten the impression somewhere that it’s quite usable except for I/O, which I know it would crash Julia. I can avoid I/O for doing my workload so that’s not quite an issue. I do hope regular things link arrays & data frames are already thread-safe.
Thanks for the link. It looks promising. The question is what’s the
Observations from my tests of multi-threading:
- inconsistent performance
- occasional crashes with 36 or more threads
I guess I will stay away from that for now…
Yeah, 100% it is quiet usable.
The experimental part is about the fact that it will change in 1.x (rather than being a breaking change to the API requiring 2.0)
And that it is missing some niceness you might really expect: like the ability to create new threads.
Because of the closure bug, and because writing fast threaded code is hard,
it can be difficult to get it to go fast.
I wrote a blog-post where I eventually failed to get faster than serial.
This post was flagged by the community and is temporarily hidden.
I managed to get scaling on 64+ threads with some work. Two things to consider. Get your allocations inside the threaded loop to
x kB, ideally to 0 , this really really increased the scaling I found by a lot. Secondly, which might impact your success on the first is to consider taking a look at the @threads macro, and just write the code down yourself, considering what variables lead to the closure bug, and put them inside the
let end block with type annotation when needed. I found that it’s not possible on 0.6.2 to completely get rid of allocations exactly due to the bug, but I managed to get them down to a couple of kB while performing an
AD energy calculation/derivative on a
FE grid of 3 million elements using around 100 threads, so it really is possible.
The allocations is really what dramatically hampers performance and consistency, the closure bug is one of the things that lead to it. The
gc basically makes your program go back to serial, and is especially impactful when you have a lot of threads. But doing the above should get you to almost linear scaling depending on your particular problem and how difficult getting 0 allocations is to implement.
If that doesn’t help too much, and you are doing some heavy linear algebra, one other thing that I ran into and did impact things by a lot is the fact that
BLAS threads and julia threads do not like eachother too much. This means that if you have a 24 core machine, and use 24 julia threads, you should put your blas threads to 1 (
BLAS.set_num_threads(1)) such that the total results in the amount of cores you are working with. Some experimentation on the best combo can lead to some improvement (10’s of % I noticed), but overthreading things definitely destroys performance.