# Exponential of a "large" vector is faster in Matlab

Hi Julia folks,

In my current project, the rate-limiting step of a simulation model is evaluating the element-wise exponential of a vector with about N = 30,000 elements:

``````# Julia MWE
using Random
trials = 10^4 # number of exponentials
N = 3*10^4 # size of vector

U = 2e4/50 * rand(N) # example values whose exponentials I need
t0 = time()
for n = 1:trials
V = Base.Math.exp_fast.(-U);
end
t1 = time()
println(t1 - t0)
``````

This takes about 3.5s on my machine, which is not bad. However, the same thing takes only 0.7s in Matlab:

``````% MATLAB MWE
trials = 10^4;
N = 3*10^4;

U = 2e4/50 * rand(N,1);
tic
for n = 1:trials
V = exp(-U);
end
toc
``````

Iâ€™m reluctant to translate my code from Julia to Matlab just for a faster element-wise exponential. However, Iâ€™m out of ideas to speed up the Julia code, and given the number of simulations Iâ€™m running, that 5x speed-up is pretty compelling.

Any suggestions?

P.S. `versioninfo()` output for Julia is below, and please let me know if any additional information would help.

``````Julia Version 1.7.2
Commit bf53498635 (2022-02-06 15:21 UTC)
Platform Info:
OS: macOS (x86_64-apple-darwin19.5.0)
CPU: Intel(R) Core(TM) i7-9750H CPU @ 2.60GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-12.0.1 (ORCJIT, skylake)
Environment:
JULIA_EDITOR =
``````
``````using LoopVectorization

@turbo @. exp(-U)
``````

It is important to also add a `.` to the `-U`.

``````julia> @time for n = 1:trials
V = Base.Math.exp_fast.(-U);
end
1.313019 seconds (88.98 k allocations: 4.472 GiB, 5.01% gc time)

julia> @time for n = 1:trials
V = Base.Math.exp_fast.(.-U);
end
1.186579 seconds (178.58 k allocations: 2.243 GiB, 2.89% gc time, 2.46% compilation time)

julia> @time for n = 1:trials
V = @turbo exp.(.-U);
end
0.256981 seconds (88.98 k allocations: 2.237 GiB, 7.41% gc time)
``````

Around a 5x speedup, but I have AVX512

``````julia> versioninfo()
Julia Version 1.10.0-DEV.250
Commit 516b097bd7 (2022-12-31 17:57 UTC)
Platform Info:
OS: Linux (x86_64-redhat-linux)
CPU: 8 Ă— 11th Gen Intel(R) Core(TM) i7-1165G7 @ 2.80GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-14.0.6 (ORCJIT, tigerlake)
``````

and the expected speedup is smaller without it.

2 Likes

Matlab uses multiple threads by default for vector operations. Julia does not.

(You can tell Julia to use multiple threads, e.g. with LoopVectorizationâ€™s `@tturbo` macro and starting Julia with threads, but it usually going to be more efficient to parallelize at a coarser level: parallelize loops that do more work in the loop body than a single elementary â€śvectorizedâ€ť operation. For that matter, the traditional â€śvectorizedâ€ť style of Matlab leaves a lot of performance on the table, because it involves allocating lots of temporary vectors and doing too little work per iteration.)

In particular, if I start Julia with `julia -t 6` (to tell it to use all 6 of my cores), I get:

``````julia> using LoopVectorization, BenchmarkTools

julia> U = rand(3*10^4);

julia> @btime @. exp(-\$U);
97.625 ÎĽs (2 allocations: 234.42 KiB)

julia> @btime @turbo @. exp(-\$U);            # single-threaded, but SIMD accelerated
32.583 ÎĽs (2 allocations: 234.42 KiB)

6

julia> @btime @tturbo @. exp(-\$U);           # multi-threaded
7.208 ÎĽs (2 allocations: 234.42 KiB)
``````

More than 10x speedup by using multithreading. But again, focusing on individual â€śelementaryâ€ť vectorized functions like this is the wrong thing to do if you want to take full advantage of Julia.

Note that Iâ€™m using the wonderful BenchmarkTools.jl packageâ€™s `@btime` macro, which automatically runs the function many times in a loop to get reliable timing statistics. The `\$U` is used by `@btime` to eliminate the performance overhead of global variables; in actual code you would not have the `\$`, but would also hopefully be writing code in functions rather than using globals.

11 Likes

This is amazingly helpful, thanks everyone!

Amazing that changing `-U` to `.-U` makes such a big difference. I wonder where else thatâ€™s lurking in my codeâ€¦

Regarding `@turbo` and @stevengjâ€™s points, in practice this exponential is just one (particularly expensive) part of the model, and Iâ€™m parallelizing with `Threads.@threads` across different simulations of the model. Is it a bad idea to put `@turbo` inside of such a loop? Based on the examples below, it seems like they work fine together (although `@tturbo` does not).

``````julia> @time for n = 1:trials
exp.(.-U);
end
2.707791 seconds (88.98 k allocations: 2.237 GiB, 5.96% gc time)

julia> @time for n = 1:trials
@turbo exp.(.-U);
end
1.432839 seconds (88.98 k allocations: 2.237 GiB, 11.88% gc time)

@turbo exp.(.-U);
end
0.685653 seconds (76.77 k allocations: 2.237 GiB, 27.38% gc time, 2.34% compilation time)
``````

And a follow-up questionâ€”does this also mean that the apparent advantage of Matlab might disappear in a less minimal setting?

1 Like

Yes.

Really, thereâ€™s no magic here that allows Matlab to compute `exp` faster than anyone else. Itâ€™s mainly a question whether you are using SIMD (enabled by `@turbo`) and threads (enabled by `Base.@threads` or `@tturbo`). Matlab pours a lot of engineering effort into making elementary functions like `exp(vector)` fast, but you can almost always beat them when doing nontrivial combinations of operations. As I wrote in my blog post:

There is a tension between two general principles in computing: on the one hand, re-using highly optimized code is good for performance; on the other other hand, optimized code that is specialized for your problem can usually beat general-purpose functions.

3 Likes

Note there there is the `@.` macro to help you with that, when fusing many broadcasted operations:

``````julia> @time for n = 1:trials # missing a dot
V = Base.Math.exp_fast.(-U);
end
1.559431 seconds (88.98 k allocations: 4.472 GiB, 1.76% gc time)

julia> @time for n = 1:trials # all dots
V = Base.Math.exp_fast.(.-U);
end
1.353012 seconds (88.98 k allocations: 2.237 GiB, 0.72% gc time)

julia> @time for n = 1:trials # with the @. macro, and no dot
V = @. Base.Math.exp_fast(-U);
end
1.359240 seconds (88.98 k allocations: 2.237 GiB, 0.99% gc time)
``````
1 Like