Exponential of a "large" vector is faster in Matlab

Hi Julia folks,

In my current project, the rate-limiting step of a simulation model is evaluating the element-wise exponential of a vector with about N = 30,000 elements:

# Julia MWE 
using Random 
trials = 10^4 # number of exponentials
N = 3*10^4 # size of vector

U = 2e4/50 * rand(N) # example values whose exponentials I need
t0 = time()
for n = 1:trials
    V = Base.Math.exp_fast.(-U);
end
t1 = time()
println(t1 - t0)

This takes about 3.5s on my machine, which is not bad. However, the same thing takes only 0.7s in Matlab:

% MATLAB MWE
trials = 10^4;
N = 3*10^4;

U = 2e4/50 * rand(N,1);
tic  
for n = 1:trials
    V = exp(-U);
end
toc

I’m reluctant to translate my code from Julia to Matlab just for a faster element-wise exponential. However, I’m out of ideas to speed up the Julia code, and given the number of simulations I’m running, that 5x speed-up is pretty compelling.

Any suggestions?

ADM

P.S. versioninfo() output for Julia is below, and please let me know if any additional information would help.

Julia Version 1.7.2
Commit bf53498635 (2022-02-06 15:21 UTC)
Platform Info:
  OS: macOS (x86_64-apple-darwin19.5.0)
  CPU: Intel(R) Core(TM) i7-9750H CPU @ 2.60GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-12.0.1 (ORCJIT, skylake)
Environment:
  JULIA_EDITOR = 
  JULIA_NUM_THREADS = 4
using LoopVectorization

@turbo @. exp(-U) 

It is important to also add a . to the -U.

julia> @time for n = 1:trials
           V = Base.Math.exp_fast.(-U);
       end
  1.313019 seconds (88.98 k allocations: 4.472 GiB, 5.01% gc time)

julia> @time for n = 1:trials
           V = Base.Math.exp_fast.(.-U);
       end
  1.186579 seconds (178.58 k allocations: 2.243 GiB, 2.89% gc time, 2.46% compilation time)

julia> @time for n = 1:trials
           V = @turbo exp.(.-U);
       end
  0.256981 seconds (88.98 k allocations: 2.237 GiB, 7.41% gc time)

Around a 5x speedup, but I have AVX512

julia> versioninfo()
Julia Version 1.10.0-DEV.250
Commit 516b097bd7 (2022-12-31 17:57 UTC)
Platform Info:
  OS: Linux (x86_64-redhat-linux)
  CPU: 8 × 11th Gen Intel(R) Core(TM) i7-1165G7 @ 2.80GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-14.0.6 (ORCJIT, tigerlake)

and the expected speedup is smaller without it.

2 Likes

Matlab uses multiple threads by default for vector operations. Julia does not.

(You can tell Julia to use multiple threads, e.g. with LoopVectorization’s @tturbo macro and starting Julia with threads, but it usually going to be more efficient to parallelize at a coarser level: parallelize loops that do more work in the loop body than a single elementary “vectorized” operation. For that matter, the traditional “vectorized” style of Matlab leaves a lot of performance on the table, because it involves allocating lots of temporary vectors and doing too little work per iteration.)

In particular, if I start Julia with julia -t 6 (to tell it to use all 6 of my cores), I get:

julia> using LoopVectorization, BenchmarkTools

julia> U = rand(3*10^4);

julia> @btime @. exp(-$U);
  97.625 μs (2 allocations: 234.42 KiB)

julia> @btime @turbo @. exp(-$U);            # single-threaded, but SIMD accelerated
  32.583 μs (2 allocations: 234.42 KiB)

julia> Threads.nthreads()    # check that we have 6 threads
6

julia> @btime @tturbo @. exp(-$U);           # multi-threaded
  7.208 μs (2 allocations: 234.42 KiB)

More than 10x speedup by using multithreading. But again, focusing on individual “elementary” vectorized functions like this is the wrong thing to do if you want to take full advantage of Julia.

Note that I’m using the wonderful BenchmarkTools.jl package’s @btime macro, which automatically runs the function many times in a loop to get reliable timing statistics. The $U is used by @btime to eliminate the performance overhead of global variables; in actual code you would not have the $, but would also hopefully be writing code in functions rather than using globals.

11 Likes

This is amazingly helpful, thanks everyone!

Amazing that changing -U to .-U makes such a big difference. I wonder where else that’s lurking in my code…

Regarding @turbo and @stevengj’s points, in practice this exponential is just one (particularly expensive) part of the model, and I’m parallelizing with Threads.@threads across different simulations of the model. Is it a bad idea to put @turbo inside of such a loop? Based on the examples below, it seems like they work fine together (although @tturbo does not).

julia> @time for n = 1:trials
           exp.(.-U);
       end
  2.707791 seconds (88.98 k allocations: 2.237 GiB, 5.96% gc time)

julia> @time for n = 1:trials
           @turbo exp.(.-U);
       end
  1.432839 seconds (88.98 k allocations: 2.237 GiB, 11.88% gc time)

julia> @time Threads.@threads for n = 1:trials
           @turbo exp.(.-U);
       end
  0.685653 seconds (76.77 k allocations: 2.237 GiB, 27.38% gc time, 2.34% compilation time)

And a follow-up question—does this also mean that the apparent advantage of Matlab might disappear in a less minimal setting?

1 Like

Yes.

Really, there’s no magic here that allows Matlab to compute exp faster than anyone else. It’s mainly a question whether you are using SIMD (enabled by @turbo) and threads (enabled by Base.@threads or @tturbo). Matlab pours a lot of engineering effort into making elementary functions like exp(vector) fast, but you can almost always beat them when doing nontrivial combinations of operations. As I wrote in my blog post:

There is a tension between two general principles in computing: on the one hand, re-using highly optimized code is good for performance; on the other other hand, optimized code that is specialized for your problem can usually beat general-purpose functions.

3 Likes

Note there there is the @. macro to help you with that, when fusing many broadcasted operations:

julia> @time for n = 1:trials # missing a dot
           V = Base.Math.exp_fast.(-U);
       end
  1.559431 seconds (88.98 k allocations: 4.472 GiB, 1.76% gc time)

julia> @time for n = 1:trials # all dots
           V = Base.Math.exp_fast.(.-U);
       end
  1.353012 seconds (88.98 k allocations: 2.237 GiB, 0.72% gc time)

julia> @time for n = 1:trials # with the @. macro, and no dot
           V = @. Base.Math.exp_fast(-U);
       end
  1.359240 seconds (88.98 k allocations: 2.237 GiB, 0.99% gc time)
1 Like