In my current project, the rate-limiting step of a simulation model is evaluating the element-wise exponential of a vector with about N = 30,000 elements:

# Julia MWE
using Random
trials = 10^4 # number of exponentials
N = 3*10^4 # size of vector
U = 2e4/50 * rand(N) # example values whose exponentials I need
t0 = time()
for n = 1:trials
V = Base.Math.exp_fast.(-U);
end
t1 = time()
println(t1 - t0)

This takes about 3.5s on my machine, which is not bad. However, the same thing takes only 0.7s in Matlab:

% MATLAB MWE
trials = 10^4;
N = 3*10^4;
U = 2e4/50 * rand(N,1);
tic
for n = 1:trials
V = exp(-U);
end
toc

Iâ€™m reluctant to translate my code from Julia to Matlab just for a faster element-wise exponential. However, Iâ€™m out of ideas to speed up the Julia code, and given the number of simulations Iâ€™m running, that 5x speed-up is pretty compelling.

Any suggestions?

ADM

P.S. versioninfo() output for Julia is below, and please let me know if any additional information would help.

julia> @time for n = 1:trials
V = Base.Math.exp_fast.(-U);
end
1.313019 seconds (88.98 k allocations: 4.472 GiB, 5.01% gc time)
julia> @time for n = 1:trials
V = Base.Math.exp_fast.(.-U);
end
1.186579 seconds (178.58 k allocations: 2.243 GiB, 2.89% gc time, 2.46% compilation time)
julia> @time for n = 1:trials
V = @turbo exp.(.-U);
end
0.256981 seconds (88.98 k allocations: 2.237 GiB, 7.41% gc time)

Around a 5x speedup, but I have AVX512

julia> versioninfo()
Julia Version 1.10.0-DEV.250
Commit 516b097bd7 (2022-12-31 17:57 UTC)
Platform Info:
OS: Linux (x86_64-redhat-linux)
CPU: 8 Ă— 11th Gen Intel(R) Core(TM) i7-1165G7 @ 2.80GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-14.0.6 (ORCJIT, tigerlake)

Matlab uses multiple threads by default for vector operations. Julia does not.

(You can tell Julia to use multiple threads, e.g. with LoopVectorizationâ€™s @tturbo macro and starting Julia with threads, but it usually going to be more efficient to parallelize at a coarser level: parallelize loops that do more work in the loop body than a single elementary â€śvectorizedâ€ť operation. For that matter, the traditional â€śvectorizedâ€ť style of Matlab leaves a lot of performance on the table, because it involves allocating lots of temporary vectors and doing too little work per iteration.)

In particular, if I start Julia with julia -t 6 (to tell it to use all 6 of my cores), I get:

julia> using LoopVectorization, BenchmarkTools
julia> U = rand(3*10^4);
julia> @btime @. exp(-$U);
97.625 ÎĽs (2 allocations: 234.42 KiB)
julia> @btime @turbo @. exp(-$U); # single-threaded, but SIMD accelerated
32.583 ÎĽs (2 allocations: 234.42 KiB)
julia> Threads.nthreads() # check that we have 6 threads
6
julia> @btime @tturbo @. exp(-$U); # multi-threaded
7.208 ÎĽs (2 allocations: 234.42 KiB)

More than 10x speedup by using multithreading. But again, focusing on individual â€śelementaryâ€ť vectorized functions like this is the wrong thing to do if you want to take full advantage of Julia.

Note that Iâ€™m using the wonderful BenchmarkTools.jl packageâ€™s @btime macro, which automatically runs the function many times in a loop to get reliable timing statistics. The $U is used by @btime to eliminate the performance overhead of global variables; in actual code you would not have the $, but would also hopefully be writing code in functions rather than using globals.

Amazing that changing -U to .-U makes such a big difference. I wonder where else thatâ€™s lurking in my codeâ€¦

Regarding @turbo and @stevengjâ€™s points, in practice this exponential is just one (particularly expensive) part of the model, and Iâ€™m parallelizing with Threads.@threads across different simulations of the model. Is it a bad idea to put @turbo inside of such a loop? Based on the examples below, it seems like they work fine together (although @tturbo does not).

julia> @time for n = 1:trials
exp.(.-U);
end
2.707791 seconds (88.98 k allocations: 2.237 GiB, 5.96% gc time)
julia> @time for n = 1:trials
@turbo exp.(.-U);
end
1.432839 seconds (88.98 k allocations: 2.237 GiB, 11.88% gc time)
julia> @time Threads.@threads for n = 1:trials
@turbo exp.(.-U);
end
0.685653 seconds (76.77 k allocations: 2.237 GiB, 27.38% gc time, 2.34% compilation time)

And a follow-up questionâ€”does this also mean that the apparent advantage of Matlab might disappear in a less minimal setting?

Really, thereâ€™s no magic here that allows Matlab to compute exp faster than anyone else. Itâ€™s mainly a question whether you are using SIMD (enabled by @turbo) and threads (enabled by Base.@threads or @tturbo). Matlab pours a lot of engineering effort into making elementary functions like exp(vector) fast, but you can almost always beat them when doing nontrivial combinations of operations. As I wrote in my blog post:

There is a tension between two general principles in computing: on the one hand, re-using highly optimized code is good for performance; on the other other hand, optimized code that is specialized for your problem can usually beat general-purpose functions.

Note there there is the @. macro to help you with that, when fusing many broadcasted operations:

julia> @time for n = 1:trials # missing a dot
V = Base.Math.exp_fast.(-U);
end
1.559431 seconds (88.98 k allocations: 4.472 GiB, 1.76% gc time)
julia> @time for n = 1:trials # all dots
V = Base.Math.exp_fast.(.-U);
end
1.353012 seconds (88.98 k allocations: 2.237 GiB, 0.72% gc time)
julia> @time for n = 1:trials # with the @. macro, and no dot
V = @. Base.Math.exp_fast(-U);
end
1.359240 seconds (88.98 k allocations: 2.237 GiB, 0.99% gc time)