Major performance boost when precaching random inputs to ```exp```?

I’m writing a Monte Carlo simulation which requires exponentials of uniformly distributed random numbers. Calling exp takes a sizeable portion of my my inner loop runtime, and so I thought to cache pre-calculated value. I then found a significant performance difference between the following methods:

using Random, BenchmarkTools

function expfunc1!(vec)
    @. vec = exp(rand());
end

function expfunc2!(vec)
    rand!(vec);
    @. vec = exp(vec);
end

v = zeros(10000);
@btime expfunc1!(v) # <--- 156.500 ΞΌs
@btime expfunc2!(v) # <--- 52.700 ΞΌs

… so a 3x speedup. Note that while the generation of random numbers individually is indeed slower than the batch generation,

@btime for i in $(eachindex(v)) rand() end # <--- 8.800 ΞΌs
@btime rand!(v) # <--- 2.678 ΞΌs

this is not enough to explain the major difference above.

Any ideas where the major performance difference might be coming from? I would hazard a guess that maybe the loop in expfunc1! isn’t being vectorized/unrolled, or that maybe expfunc2! has better cache coherence?

Since my Monte Carlo simulation is not straightforwardly parallelizable, the original optimization idea I had in mind was to parallelize/multithread the precomputation of exp before continuing single-threaded. But such a 3x speedup for a trivial (I assume single-threaded?) precomputation is already quite significant!

yeah this is a somewhat accidental result where in 1.8 exp is getting automatically inlined which allows it to vectorize automatically. This wasn’t fully intentional, and other than vectorization, it’s probably bad for exp to inline (it’s a lot of code).

@Oscar_Smith I should have clarified: the above was measured on 1.7.3. On 1.8, the runtime of expfunc2! drops by another factor of ~3x thanks to exp’s inlining and vectorization.

@btime expfunc1!(v);  # <-- 156.500 ΞΌs  (julia 1.7.3.); 48.700 ΞΌs (julia 1.8-rc4)
@btime expfunc2!(v);  # <-- 52.800 ΞΌs  (julia 1.7.3.); 15.200 ΞΌs (julia 1.8-rc4)

The same performance disparity between implementations occurs when calling another function, say log, that isn’t inlined. The numbers then remain essentially the same between 1.7 and 1.8.

(Separately, is there a way to manually force the inlining and vectorization of log or exp once its default inlining is patched out?)

Yes, it appears that from v1.8 you can use @inline at callsites: optimizer: supports callsite annotations of inlining, fixes #18773 by aviatesk Β· Pull Request #41328 Β· JuliaLang/julia Β· GitHub

Also, you can use LoopVectorization to get handwritten vectorized versions.

@DNF, @Oscar_Smith : Thanks! Trying @inline on this particular example (say with log) leads to inlining but no vectorization. But using LoopVectorization’s @turbo both inlines and vectorizes.

Okay - so back to the original question: Any idea why expfunc1! is so much slower than expfunc2!? My naive expectation would have been that since expfunc2! needs to scan the array twice, it requires more memory accesses, and potentially leads to more cache misses. Either that’s not the case, or it is shadowed by a bigger gain.

The performance difference between expfunc1! and expfunc2! is arises due to the vectorization of random number generation, which currently only occurs in calls to rand!. This can be demonstrated by calling @code_typed on each of the functions. Or, to remove the extraneous details introduced by exp, see below.

julia> function randeach!(x)
           @inbounds for i ∈ eachindex(x)
               x[i] = rand()
           end
           x
       end;

Then @code_typed randeach!(v)

julia> @code_typed randeach!(v)
CodeInfo(
1 ── %1  = Base.arraysize(x, 1)::Int64
β”‚    %2  = Base.slt_int(%1, 0)::Bool
β”‚    %3  = Core.ifelse::typeof(Core.ifelse)
β”‚    %4  = (%3)(%2, 0, %1)::Int64
β”‚    %5  = Base.slt_int(%4, 1)::Bool
└───       goto #3 if not %5
2 ──       goto #4
3 ──       goto #4
4 ┄─ %9  = Ο† (#2 => true, #3 => false)::Bool
β”‚    %10 = Ο† (#3 => 1)::Int64
β”‚    %11 = Ο† (#3 => 1)::Int64
β”‚    %12 = Base.not_int(%9)::Bool
└───       goto #10 if not %12
5 ┄─ %14 = Ο† (#4 => %10, #9 => %66)::Int64
β”‚    %15 = Ο† (#4 => %11, #9 => %67)::Int64
β”‚    %16 = $(Expr(:foreigncall, :(:jl_get_current_task), Ref{Task}, svec(), 0, :(:ccall)))::Task
β”‚    %17 = Base.getfield(%16, :rngState0)::UInt64
β”‚    %18 = Base.getfield(%16, :rngState1)::UInt64
β”‚    %19 = Base.getfield(%16, :rngState2)::UInt64
β”‚    %20 = Base.getfield(%16, :rngState3)::UInt64
β”‚    %21 = Base.add_int(%17, %20)::UInt64
β”‚    %22 = Base.shl_int(%21, 0x0000000000000017)::UInt64
β”‚    %23 = Base.lshr_int(%21, 0xffffffffffffffe9)::UInt64
β”‚    %24 = Core.ifelse::typeof(Core.ifelse)
β”‚    %25 = (%24)(true, %22, %23)::UInt64
β”‚    %26 = Base.lshr_int(%21, 0x0000000000000029)::UInt64
β”‚    %27 = Base.shl_int(%21, 0xffffffffffffffd7)::UInt64
β”‚    %28 = Core.ifelse::typeof(Core.ifelse)
β”‚    %29 = (%28)(true, %26, %27)::UInt64
β”‚    %30 = Base.or_int(%25, %29)::UInt64
β”‚    %31 = Base.add_int(%30, %17)::UInt64
β”‚    %32 = Base.shl_int(%18, 0x0000000000000011)::UInt64
β”‚    %33 = Base.lshr_int(%18, 0xffffffffffffffef)::UInt64
β”‚    %34 = Core.ifelse::typeof(Core.ifelse)
β”‚    %35 = (%34)(true, %32, %33)::UInt64
β”‚    %36 = Base.xor_int(%19, %17)::UInt64
β”‚    %37 = Base.xor_int(%20, %18)::UInt64
β”‚    %38 = Base.xor_int(%18, %36)::UInt64
β”‚    %39 = Base.xor_int(%17, %37)::UInt64
β”‚    %40 = Base.xor_int(%36, %35)::UInt64
β”‚    %41 = Base.shl_int(%37, 0x000000000000002d)::UInt64
β”‚    %42 = Base.lshr_int(%37, 0xffffffffffffffd3)::UInt64
β”‚    %43 = Core.ifelse::typeof(Core.ifelse)
β”‚    %44 = (%43)(true, %41, %42)::UInt64
β”‚    %45 = Base.lshr_int(%37, 0x0000000000000013)::UInt64
β”‚    %46 = Base.shl_int(%37, 0xffffffffffffffed)::UInt64
β”‚    %47 = Core.ifelse::typeof(Core.ifelse)
β”‚    %48 = (%47)(true, %45, %46)::UInt64
β”‚    %49 = Base.or_int(%44, %48)::UInt64
β”‚          Base.setfield!(%16, :rngState0, %39)::UInt64
β”‚          Base.setfield!(%16, :rngState1, %38)::UInt64
β”‚          Base.setfield!(%16, :rngState2, %40)::UInt64
β”‚          Base.setfield!(%16, :rngState3, %49)::UInt64
β”‚    %54 = Base.lshr_int(%31, 0x000000000000000b)::UInt64
β”‚    %55 = Base.shl_int(%31, 0xfffffffffffffff5)::UInt64
β”‚    %56 = Core.ifelse::typeof(Core.ifelse)
β”‚    %57 = (%56)(true, %54, %55)::UInt64
β”‚    %58 = Base.uitofp(Float64, %57)::Float64
β”‚    %59 = Base.mul_float(%58, 1.1102230246251565e-16)::Float64
β”‚          Base.arrayset(false, x, %59, %14)::Vector{Float64}
β”‚    %61 = (%15 === %4)::Bool
└───       goto #7 if not %61
6 ──       goto #8
7 ── %64 = Base.add_int(%15, 1)::Int64
└───       goto #8
8 ┄─ %66 = Ο† (#7 => %64)::Int64
β”‚    %67 = Ο† (#7 => %64)::Int64
β”‚    %68 = Ο† (#6 => true, #7 => false)::Bool
β”‚    %69 = Base.not_int(%68)::Bool
└───       goto #10 if not %69
9 ──       goto #5
10 β”„       return x
) => Vector{Float64}

And @code_typed rand!(v)

julia> @code_typed rand!(v)
CodeInfo(
1 ─ %1  = $(Expr(:gc_preserve_begin, Core.Argument(2)))
β”‚   %2  = $(Expr(:foreigncall, :(:jl_array_ptr), Ptr{Float64}, svec(Any), 0, :(:ccall), Core.Argument(2)))::Ptr{Float64}
β”‚   %3  = Base.bitcast(Ptr{UInt8}, %2)::Ptr{UInt8}
β”‚   %4  = Base.arraylen(A)::Int64
β”‚   %5  = Base.mul_int(%4, 8)::Int64
β”‚   %6  = Random.XoshiroSimd.Float64::Type{Float64}
β”‚   %7  = Random.XoshiroSimd._bits2float::typeof(Random.XoshiroSimd._bits2float)
β”‚   %8  = Base.sle_int(64, %5)::Bool
└──       goto #3 if not %8
2 ─ %10 = invoke Random.XoshiroSimd.xoshiro_bulk_simd($(QuoteNode(TaskLocalRNG()))::TaskLocalRNG, %3::Ptr{UInt8}, %5::Int64, %6::Type{Float64}, $(QuoteNode(Val{8}()))::Val{8}, %7::typeof(Random.XoshiroSimd._bits2float))::Int64
β”‚   %11 = Base.sub_int(%5, %10)::Int64
β”‚   %12 = Core.bitcast(Core.UInt, %3)::UInt64
β”‚   %13 = Base.bitcast(UInt64, %10)::UInt64
β”‚   %14 = Base.add_ptr(%12, %13)::UInt64
└── %15 = Core.bitcast(Ptr{UInt8}, %14)::Ptr{UInt8}
3 β”„ %16 = Ο† (#2 => %15, #1 => %3)::Ptr{UInt8}
β”‚   %17 = Ο† (#2 => %11, #1 => %5)::Int64
β”‚   %18 = (%17 === 0)::Bool
β”‚   %19 = Base.not_int(%18)::Bool
└──       goto #5 if not %19
4 ─       invoke Random.XoshiroSimd.xoshiro_bulk_nosimd($(QuoteNode(TaskLocalRNG()))::TaskLocalRNG, %16::Ptr{UInt8}, %17::Int64, %6::Type{Float64}, %7::typeof(Random.XoshiroSimd._bits2float))::Any
5 β”„       goto #6
6 ─       $(Expr(:gc_preserve_end, :(%1)))
└──       goto #7
7 ─       goto #8
8 ─       goto #9
9 ─       return A
) => Vector{Float64}

3 Likes

@AndrewRadcliffe that makes sense. Benchmarking looping over rand() versus rand!(v) indeed yields a similar 3x performance boost. Yet this amounts to only a 6ΞΌs difference (see my original post), so I don’t yet understand why this is magnified to tens of ΞΌs when exp is introduced?

Changing exp to log also causes a similarly large performance difference, it seems.

function expfunc3!(vec)
    @. vec = log(rand());
end

function expfunc4!(vec)
    rand!(vec);
    @. vec = log(vec);
end

v = zeros(10000);
@btime expfunc3!(v)
# 236.399 ΞΌs (0 allocations: 0 bytes)

v = zeros(10000);
@btime expfunc4!(v)
# 95.041 ΞΌs (0 allocations: 0 bytes)