In my current project, the rate-limiting step of a simulation model is evaluating the element-wise exponential of a vector with about N = 30,000 elements:
# Julia MWE
using Random
trials = 10^4 # number of exponentials
N = 3*10^4 # size of vector
U = 2e4/50 * rand(N) # example values whose exponentials I need
t0 = time()
for n = 1:trials
V = Base.Math.exp_fast.(-U);
end
t1 = time()
println(t1 - t0)
This takes about 3.5s on my machine, which is not bad. However, the same thing takes only 0.7s in Matlab:
% MATLAB MWE
trials = 10^4;
N = 3*10^4;
U = 2e4/50 * rand(N,1);
tic
for n = 1:trials
V = exp(-U);
end
toc
I’m reluctant to translate my code from Julia to Matlab just for a faster element-wise exponential. However, I’m out of ideas to speed up the Julia code, and given the number of simulations I’m running, that 5x speed-up is pretty compelling.
Any suggestions?
ADM
P.S. versioninfo() output for Julia is below, and please let me know if any additional information would help.
julia> @time for n = 1:trials
V = Base.Math.exp_fast.(-U);
end
1.313019 seconds (88.98 k allocations: 4.472 GiB, 5.01% gc time)
julia> @time for n = 1:trials
V = Base.Math.exp_fast.(.-U);
end
1.186579 seconds (178.58 k allocations: 2.243 GiB, 2.89% gc time, 2.46% compilation time)
julia> @time for n = 1:trials
V = @turbo exp.(.-U);
end
0.256981 seconds (88.98 k allocations: 2.237 GiB, 7.41% gc time)
Around a 5x speedup, but I have AVX512
julia> versioninfo()
Julia Version 1.10.0-DEV.250
Commit 516b097bd7 (2022-12-31 17:57 UTC)
Platform Info:
OS: Linux (x86_64-redhat-linux)
CPU: 8 Ă— 11th Gen Intel(R) Core(TM) i7-1165G7 @ 2.80GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-14.0.6 (ORCJIT, tigerlake)
Matlab uses multiple threads by default for vector operations. Julia does not.
(You can tell Julia to use multiple threads, e.g. with LoopVectorization’s @tturbo macro and starting Julia with threads, but it usually going to be more efficient to parallelize at a coarser level: parallelize loops that do more work in the loop body than a single elementary “vectorized” operation. For that matter, the traditional “vectorized” style of Matlab leaves a lot of performance on the table, because it involves allocating lots of temporary vectors and doing too little work per iteration.)
In particular, if I start Julia with julia -t 6 (to tell it to use all 6 of my cores), I get:
julia> using LoopVectorization, BenchmarkTools
julia> U = rand(3*10^4);
julia> @btime @. exp(-$U);
97.625 ÎĽs (2 allocations: 234.42 KiB)
julia> @btime @turbo @. exp(-$U); # single-threaded, but SIMD accelerated
32.583 ÎĽs (2 allocations: 234.42 KiB)
julia> Threads.nthreads() # check that we have 6 threads
6
julia> @btime @tturbo @. exp(-$U); # multi-threaded
7.208 ÎĽs (2 allocations: 234.42 KiB)
More than 10x speedup by using multithreading. But again, focusing on individual “elementary” vectorized functions like this is the wrong thing to do if you want to take full advantage of Julia.
Note that I’m using the wonderful BenchmarkTools.jl package’s @btime macro, which automatically runs the function many times in a loop to get reliable timing statistics. The $U is used by @btime to eliminate the performance overhead of global variables; in actual code you would not have the $, but would also hopefully be writing code in functions rather than using globals.
Amazing that changing -U to .-U makes such a big difference. I wonder where else that’s lurking in my code…
Regarding @turbo and @stevengj’s points, in practice this exponential is just one (particularly expensive) part of the model, and I’m parallelizing with Threads.@threads across different simulations of the model. Is it a bad idea to put @turbo inside of such a loop? Based on the examples below, it seems like they work fine together (although @tturbo does not).
julia> @time for n = 1:trials
exp.(.-U);
end
2.707791 seconds (88.98 k allocations: 2.237 GiB, 5.96% gc time)
julia> @time for n = 1:trials
@turbo exp.(.-U);
end
1.432839 seconds (88.98 k allocations: 2.237 GiB, 11.88% gc time)
julia> @time Threads.@threads for n = 1:trials
@turbo exp.(.-U);
end
0.685653 seconds (76.77 k allocations: 2.237 GiB, 27.38% gc time, 2.34% compilation time)
And a follow-up question—does this also mean that the apparent advantage of Matlab might disappear in a less minimal setting?
Really, there’s no magic here that allows Matlab to compute exp faster than anyone else. It’s mainly a question whether you are using SIMD (enabled by @turbo) and threads (enabled by Base.@threads or @tturbo). Matlab pours a lot of engineering effort into making elementary functions like exp(vector) fast, but you can almost always beat them when doing nontrivial combinations of operations. As I wrote in my blog post:
There is a tension between two general principles in computing: on the one hand, re-using highly optimized code is good for performance; on the other other hand, optimized code that is specialized for your problem can usually beat general-purpose functions.
Note there there is the @. macro to help you with that, when fusing many broadcasted operations:
julia> @time for n = 1:trials # missing a dot
V = Base.Math.exp_fast.(-U);
end
1.559431 seconds (88.98 k allocations: 4.472 GiB, 1.76% gc time)
julia> @time for n = 1:trials # all dots
V = Base.Math.exp_fast.(.-U);
end
1.353012 seconds (88.98 k allocations: 2.237 GiB, 0.72% gc time)
julia> @time for n = 1:trials # with the @. macro, and no dot
V = @. Base.Math.exp_fast(-U);
end
1.359240 seconds (88.98 k allocations: 2.237 GiB, 0.99% gc time)