I would like to ask you, whether there is some automatic parallelization tool in Julia similar to auto parallelization capabilities of Intel/gcc compilers for Fortran/C++. It would be really awesome if Julia compiler could transform standard serial codes into parallel ones!
Yeah, you can do all kinds of things. Depends on what level you call automated though. There’s things like CuArrays and DistributedArrays that recompile your code to GPUs and distributed CPUs respectively, KernelAbstractions.jl that recompiles quite a big set of Julia code to GPUs, and recently things like ModelingToolkit that will take a Julia ODE code and recompile it in a multithreaded way:
Thank you for your reply!
I ment “highest-level”/implicit paralellism. As far, as I understand powers of Intel Fortran/C++ compiler, it basically takes a whole program, and (if auto-par option is enabled) it automatically search for parallelizable parts of whole code (loops, etc…), so programmer doesn’t have to care about explicitely declaring which part of the code should be parallelized. Especially for less experienced programmers (like me), I would guess, that good compiler optimization could provide better results than explicit parallelization.
As far as I understand, the general philosophy that’s been taken so far by the Julia developers is that an optimization should only be automatically applied if they know for sure that
It’s correct / safe to apply the optimization
The optimization won’t accidentally hurt performance.
Unfortunately, implicit multi-threading makes both of the above criteria very difficult to satisfy. Even if the safety / correctness concern were satisfied (which is not trivial to do), multi-threading has a lot of overhead. The general heuristic is that it takes about 1 microsecond to spawn a multi-threaded task in Julia which is on the order of 1000 CPU cycles. This means that if I write
for i in 1:N
f(i)
end
if it takes less than ~10 microseconds to run that loop, it was probably a mistake to try and multi-thread it. However, the amount of time the loop takes to run depends not only on N, but the details of f. Knowledge about how to handle this right is not something our compiler currently has or is likely to have anytime soon.
Instead, we generally insist that the programmer opts in to optimizations like multi-threading explicitly because they know more about their program than the compiler. However, we generally try to make it very easy to opt into these sorts of things which is where things like the performance annotations in base (Threads.@threads, @simd, @fastmath, etc.), and various packages like KernelAbstractions.jl, LoopVectorization.jl and ThreadsX.jl come in.