Performance of Float32 exponential

@LaurentPlagne and myself recently noticed the following surprising performances:

julia> using BenchmarkTools
julia> N = 100_000
julia> x32 = rand(Float32, N);
julia> x64 = Float64.(x32);

julia> sumExpStd(x) = sum(exp, x)
sumExpStd (generic function with 1 method)
julia> versioninfo()
Julia Version 1.2.0
Commit c6da87ff4b (2019-08-20 00:03 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Core(TM) i5-6200U CPU @ 2.30GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-6.0.1 (ORCJIT, skylake)
Environment:
  JULIA_PROJECT = @.

julia> @btime sumExpStd($x64)
  1.141 ms (0 allocations: 0 bytes)
171535.09093746456

julia> @btime sumExpStd($x32)
  983.589 μs (0 allocations: 0 bytes)
171535.1f0

We’d have expected the Float32 exponential to be significantly faster than its Float64 counterpart.

With julia-1.3.0, the observation is mostly the same:

julia> versioninfo()
Julia Version 1.3.0
Commit 46ce4d7933 (2019-11-26 06:09 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Core(TM) i5-6200U CPU @ 2.30GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-6.0.1 (ORCJIT, skylake)
Environment:
  JULIA_PROJECT = @.

julia> @btime sumExpStd($x64)
  1.122 ms (0 allocations: 0 bytes)
171798.45158585

julia> @btime sumExpStd($x32)
  1.010 ms (0 allocations: 0 bytes)
171798.47f0

And indeed, when changing the implementation of exp from Base.exp to SLEEF.exp, we observe significantly lower computing times for Float32 exponentials, while Float64 computing times remain in the same ballpark:

julia> import SLEEF

julia> sumExpSLEEF(x) = sum(SLEEF.exp, x)
sumExpSLEEF (generic function with 1 method)
julia> versioninfo()
Julia Version 1.2.0
Commit c6da87ff4b (2019-08-20 00:03 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Core(TM) i5-6200U CPU @ 2.30GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-6.0.1 (ORCJIT, skylake)
Environment:
  JULIA_PROJECT = @.

julia> @btime sumExpSLEEF($x64)
  1.145 ms (0 allocations: 0 bytes)
171535.09093746456

julia> @btime sumExpSLEEF($x32)
  205.272 μs (0 allocations: 0 bytes)
171535.08f0

and there seems to have been something happening in v1.3.0 here, but Float32 exponentials are still faster than Float64 exponentials.

julia> versioninfo()
Julia Version 1.3.0
Commit 46ce4d7933 (2019-11-26 06:09 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Core(TM) i5-6200U CPU @ 2.30GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-6.0.1 (ORCJIT, skylake)
Environment:
  JULIA_PROJECT = @.

julia> @btime sumExpSLEEF($x64)
  1.145 ms (0 allocations: 0 bytes)
171798.45158585

julia> @btime sumExpSLEEF($x32)
  868.180 μs (0 allocations: 0 bytes)
171798.47f0

Does anyone know what happens here? Is it a performance “bug” in your opinion?

Julia exp does not emit SIMD instructions, when you use SLEEF, simd is probably used making it much faster. Twice as many Float32 fit into a SIMD register compared to Float64.

See some discussion in this answer and thread

1 Like

Thanks for your answer. This might very well have to do with vectorization, but there remain a few questions:

  • if we consider the Base.exp-based versions to not be vectorized (neither in Float32 nor Float64), then I would still expect the Float32 version to go faster (not because of SIMD packs of Float32 being twice as large as packs of Float64, but because it is “simpler” to compute te exponential of a Float32 number)

  • if we think that the SLEEF versions are vectorized, then we should measure a speedup between the scalar Base.exp(Float64) computation and the vectorized SLEEF.exp(Float64) version.

Looking at the native code of both Base- and SLEEF-based versions, I think none of them is vectorized (although I agree that the SLEEF-based version could be, using the techniques mentioned in the thread that you linked above)

open("std.out", "w") do f
    redirect_stdout(f) do
        @code_native Base._mapreduce(exp, +, IndexLinear(), x64)
    end
end

open("SLEEF.out", "w") do f
    redirect_stdout(f) do
        @code_native Base._mapreduce(SLEEF.exp, +, IndexLinear(), x64)
    end
end

There are only marginal differences between both implementations, and I would think that they only differ by the way they compute the exponential. Hence my conclusion that Base.exp(::Float32) is slower than it could/should. And this might very well be related with if/how SIMD instructions are used within the implementation of exp, but that is another story…

By the way, a small benchmark in C++ (calling the GNU libm, and without any vectorization) also tends to show that the computation of a single exponential should be faster for single-precision numbers than for double-precision ones:

sh> # double precision
sh> clang++ -O3 sumExp.cxx -o sumExp && ./sumExp 100000
17180511.2225997
715.39 µs/call
sh> # single precision
sh> clang++ -O3 sumExp.cxx -o sumExp && ./sumExp 100000
16964172
538.53 µs/call
complete C++ code
#include <iostream>
#include <sstream>
#include <vector>
#include <cmath>
#include <iomanip>
#include <sys/time.h>


int main(int argc, char **argv) {
  //typedef double T;
  typedef float T;

  int N;
  std::istringstream iss(argv[1]);
  iss >> N;

  std::vector<T> v(N);
  for (size_t i=0 ; i<N ; ++i) {
    v[i] = T(rand())/RAND_MAX;
  }

  struct timeval beg; gettimeofday(&beg, NULL);

  T acc(0);
  for (int k=0 ; k<100 ; ++k) {
    for (size_t i=0 ; i<N ; ++i) {
      acc += std::exp(v[i]);
    }
  }

  struct timeval end; gettimeofday(&end, NULL);
  auto duration = 1e6*(end.tv_sec-beg.tv_sec) + (end.tv_usec-beg.tv_usec);


  std::cout << std::setprecision(15) << acc << std::endl;
  std::cout << duration/100 << " µs/call" << std::endl;
}

The implementation of exp in Julia is described in this PR (https://github.com/JuliaLang/julia/pull/19831) — it was ported from the C openlibm library, and the pure-Julia implementation was found to be at least as fast as the well-optimized C code.

If you look at the implementation, you’ll see that the Float32 algorithm is only marginally simpler than the Float64 algorithm — there is one place (exp_kernel) where Float32 can use a lower-degree polynomial expansion than Float64, but most of the computation is occupied by argument reductions and other logic that is identical for the two precisions. (Of course, it’s possible that there is a better exp algorithm than the one used in openlibm.)

(Of course, Float32 arithmetic is in some sense intrinsically simpler than Float64. So you might as well argue that an operation like * or + should be faster in single precision. But the CPU designers compensate for this by allocating enough chip area to the Float64 operations in order to make them about the same speed … unless you go to SIMD operations, in which case Float32 is faster because the same register width can perform more parallel operations. Operating on Float32 data is also faster because twice as much of it fits in cache and cache lines, which may affect your sum(exp, data) benchmark.)

6 Likes