Major performance boost when precaching random inputs to ```exp```?

bremez · August 14, 2022, 7:18pm

I’m writing a Monte Carlo simulation which requires exponentials of uniformly distributed random numbers. Calling exp takes a sizeable portion of my my inner loop runtime, and so I thought to cache pre-calculated value. I then found a significant performance difference between the following methods:

using Random, BenchmarkTools

function expfunc1!(vec)
    @. vec = exp(rand());
end

function expfunc2!(vec)
    rand!(vec);
    @. vec = exp(vec);
end

v = zeros(10000);
@btime expfunc1!(v) # <--- 156.500 μs
@btime expfunc2!(v) # <--- 52.700 μs

… so a 3x speedup. Note that while the generation of random numbers individually is indeed slower than the batch generation,

@btime for i in $(eachindex(v)) rand() end # <--- 8.800 μs
@btime rand!(v) # <--- 2.678 μs

this is not enough to explain the major difference above.

Any ideas where the major performance difference might be coming from? I would hazard a guess that maybe the loop in expfunc1! isn’t being vectorized/unrolled, or that maybe expfunc2! has better cache coherence?

Since my Monte Carlo simulation is not straightforwardly parallelizable, the original optimization idea I had in mind was to parallelize/multithread the precomputation of exp before continuing single-threaded. But such a 3x speedup for a trivial (I assume single-threaded?) precomputation is already quite significant!

Oscar_Smith · August 16, 2022, 1:22pm

yeah this is a somewhat accidental result where in 1.8 exp is getting automatically inlined which allows it to vectorize automatically. This wasn’t fully intentional, and other than vectorization, it’s probably bad for exp to inline (it’s a lot of code).

bremez · August 16, 2022, 1:55pm

@Oscar_Smith I should have clarified: the above was measured on 1.7.3. On 1.8, the runtime of expfunc2! drops by another factor of ~3x thanks to exp’s inlining and vectorization.

@btime expfunc1!(v);  # <-- 156.500 μs  (julia 1.7.3.); 48.700 μs (julia 1.8-rc4)
@btime expfunc2!(v);  # <-- 52.800 μs  (julia 1.7.3.); 15.200 μs (julia 1.8-rc4)

The same performance disparity between implementations occurs when calling another function, say log, that isn’t inlined. The numbers then remain essentially the same between 1.7 and 1.8.

(Separately, is there a way to manually force the inlining and vectorization of log or exp once its default inlining is patched out?)

DNF · August 16, 2022, 2:03pm

Yes, it appears that from v1.8 you can use @inline at callsites: optimizer: supports callsite annotations of inlining, fixes #18773 by aviatesk · Pull Request #41328 · JuliaLang/julia · GitHub

Oscar_Smith · August 16, 2022, 2:10pm

Also, you can use LoopVectorization to get handwritten vectorized versions.

bremez · August 16, 2022, 2:39pm

@DNF, @Oscar_Smith : Thanks! Trying @inline on this particular example (say with log) leads to inlining but no vectorization. But using LoopVectorization’s @turbo both inlines and vectorizes.

Okay - so back to the original question: Any idea why expfunc1! is so much slower than expfunc2!? My naive expectation would have been that since expfunc2! needs to scan the array twice, it requires more memory accesses, and potentially leads to more cache misses. Either that’s not the case, or it is shadowed by a bigger gain.

AndrewRadcliffe · August 23, 2022, 10:26pm

The performance difference between expfunc1! and expfunc2! is arises due to the vectorization of random number generation, which currently only occurs in calls to rand!. This can be demonstrated by calling @code_typed on each of the functions. Or, to remove the extraneous details introduced by exp, see below.

julia> function randeach!(x)
           @inbounds for i ∈ eachindex(x)
               x[i] = rand()
           end
           x
       end;

Then @code_typed randeach!(v)

julia> @code_typed randeach!(v)
CodeInfo(
1 ── %1  = Base.arraysize(x, 1)::Int64
│    %2  = Base.slt_int(%1, 0)::Bool
│    %3  = Core.ifelse::typeof(Core.ifelse)
│    %4  = (%3)(%2, 0, %1)::Int64
│    %5  = Base.slt_int(%4, 1)::Bool
└───       goto #3 if not %5
2 ──       goto #4
3 ──       goto #4
4 ┄─ %9  = φ (#2 => true, #3 => false)::Bool
│    %10 = φ (#3 => 1)::Int64
│    %11 = φ (#3 => 1)::Int64
│    %12 = Base.not_int(%9)::Bool
└───       goto #10 if not %12
5 ┄─ %14 = φ (#4 => %10, #9 => %66)::Int64
│    %15 = φ (#4 => %11, #9 => %67)::Int64
│    %16 = $(Expr(:foreigncall, :(:jl_get_current_task), Ref{Task}, svec(), 0, :(:ccall)))::Task
│    %17 = Base.getfield(%16, :rngState0)::UInt64
│    %18 = Base.getfield(%16, :rngState1)::UInt64
│    %19 = Base.getfield(%16, :rngState2)::UInt64
│    %20 = Base.getfield(%16, :rngState3)::UInt64
│    %21 = Base.add_int(%17, %20)::UInt64
│    %22 = Base.shl_int(%21, 0x0000000000000017)::UInt64
│    %23 = Base.lshr_int(%21, 0xffffffffffffffe9)::UInt64
│    %24 = Core.ifelse::typeof(Core.ifelse)
│    %25 = (%24)(true, %22, %23)::UInt64
│    %26 = Base.lshr_int(%21, 0x0000000000000029)::UInt64
│    %27 = Base.shl_int(%21, 0xffffffffffffffd7)::UInt64
│    %28 = Core.ifelse::typeof(Core.ifelse)
│    %29 = (%28)(true, %26, %27)::UInt64
│    %30 = Base.or_int(%25, %29)::UInt64
│    %31 = Base.add_int(%30, %17)::UInt64
│    %32 = Base.shl_int(%18, 0x0000000000000011)::UInt64
│    %33 = Base.lshr_int(%18, 0xffffffffffffffef)::UInt64
│    %34 = Core.ifelse::typeof(Core.ifelse)
│    %35 = (%34)(true, %32, %33)::UInt64
│    %36 = Base.xor_int(%19, %17)::UInt64
│    %37 = Base.xor_int(%20, %18)::UInt64
│    %38 = Base.xor_int(%18, %36)::UInt64
│    %39 = Base.xor_int(%17, %37)::UInt64
│    %40 = Base.xor_int(%36, %35)::UInt64
│    %41 = Base.shl_int(%37, 0x000000000000002d)::UInt64
│    %42 = Base.lshr_int(%37, 0xffffffffffffffd3)::UInt64
│    %43 = Core.ifelse::typeof(Core.ifelse)
│    %44 = (%43)(true, %41, %42)::UInt64
│    %45 = Base.lshr_int(%37, 0x0000000000000013)::UInt64
│    %46 = Base.shl_int(%37, 0xffffffffffffffed)::UInt64
│    %47 = Core.ifelse::typeof(Core.ifelse)
│    %48 = (%47)(true, %45, %46)::UInt64
│    %49 = Base.or_int(%44, %48)::UInt64
│          Base.setfield!(%16, :rngState0, %39)::UInt64
│          Base.setfield!(%16, :rngState1, %38)::UInt64
│          Base.setfield!(%16, :rngState2, %40)::UInt64
│          Base.setfield!(%16, :rngState3, %49)::UInt64
│    %54 = Base.lshr_int(%31, 0x000000000000000b)::UInt64
│    %55 = Base.shl_int(%31, 0xfffffffffffffff5)::UInt64
│    %56 = Core.ifelse::typeof(Core.ifelse)
│    %57 = (%56)(true, %54, %55)::UInt64
│    %58 = Base.uitofp(Float64, %57)::Float64
│    %59 = Base.mul_float(%58, 1.1102230246251565e-16)::Float64
│          Base.arrayset(false, x, %59, %14)::Vector{Float64}
│    %61 = (%15 === %4)::Bool
└───       goto #7 if not %61
6 ──       goto #8
7 ── %64 = Base.add_int(%15, 1)::Int64
└───       goto #8
8 ┄─ %66 = φ (#7 => %64)::Int64
│    %67 = φ (#7 => %64)::Int64
│    %68 = φ (#6 => true, #7 => false)::Bool
│    %69 = Base.not_int(%68)::Bool
└───       goto #10 if not %69
9 ──       goto #5
10 ┄       return x
) => Vector{Float64}

And @code_typed rand!(v)

julia> @code_typed rand!(v)
CodeInfo(
1 ─ %1  = $(Expr(:gc_preserve_begin, Core.Argument(2)))
│   %2  = $(Expr(:foreigncall, :(:jl_array_ptr), Ptr{Float64}, svec(Any), 0, :(:ccall), Core.Argument(2)))::Ptr{Float64}
│   %3  = Base.bitcast(Ptr{UInt8}, %2)::Ptr{UInt8}
│   %4  = Base.arraylen(A)::Int64
│   %5  = Base.mul_int(%4, 8)::Int64
│   %6  = Random.XoshiroSimd.Float64::Type{Float64}
│   %7  = Random.XoshiroSimd._bits2float::typeof(Random.XoshiroSimd._bits2float)
│   %8  = Base.sle_int(64, %5)::Bool
└──       goto #3 if not %8
2 ─ %10 = invoke Random.XoshiroSimd.xoshiro_bulk_simd($(QuoteNode(TaskLocalRNG()))::TaskLocalRNG, %3::Ptr{UInt8}, %5::Int64, %6::Type{Float64}, $(QuoteNode(Val{8}()))::Val{8}, %7::typeof(Random.XoshiroSimd._bits2float))::Int64
│   %11 = Base.sub_int(%5, %10)::Int64
│   %12 = Core.bitcast(Core.UInt, %3)::UInt64
│   %13 = Base.bitcast(UInt64, %10)::UInt64
│   %14 = Base.add_ptr(%12, %13)::UInt64
└── %15 = Core.bitcast(Ptr{UInt8}, %14)::Ptr{UInt8}
3 ┄ %16 = φ (#2 => %15, #1 => %3)::Ptr{UInt8}
│   %17 = φ (#2 => %11, #1 => %5)::Int64
│   %18 = (%17 === 0)::Bool
│   %19 = Base.not_int(%18)::Bool
└──       goto #5 if not %19
4 ─       invoke Random.XoshiroSimd.xoshiro_bulk_nosimd($(QuoteNode(TaskLocalRNG()))::TaskLocalRNG, %16::Ptr{UInt8}, %17::Int64, %6::Type{Float64}, %7::typeof(Random.XoshiroSimd._bits2float))::Any
5 ┄       goto #6
6 ─       $(Expr(:gc_preserve_end, :(%1)))
└──       goto #7
7 ─       goto #8
8 ─       goto #9
9 ─       return A
) => Vector{Float64}

bremez · August 27, 2022, 9:10am

@AndrewRadcliffe that makes sense. Benchmarking looping over rand() versus rand!(v) indeed yields a similar 3x performance boost. Yet this amounts to only a 6μs difference (see my original post), so I don’t yet understand why this is magnified to tens of μs when exp is introduced?

numberMX · September 25, 2022, 1:28pm

Changing exp to log also causes a similarly large performance difference, it seems.

function expfunc3!(vec)
    @. vec = log(rand());
end

function expfunc4!(vec)
    rand!(vec);
    @. vec = log(vec);
end

v = zeros(10000);
@btime expfunc3!(v)
# 236.399 μs (0 allocations: 0 bytes)

v = zeros(10000);
@btime expfunc4!(v)
# 95.041 μs (0 allocations: 0 bytes)

Topic		Replies	Views
Comparing exp() performance on Julia versus numpy Performance	38	1909	August 8, 2022
Multithreading and Monte Carlo in 1.7 General Usage multithreading , random	12	905	December 31, 2021
Effective simulation of putting n-1 balls in n boxes uniformly at random Performance	12	1167	May 4, 2019
Performance of Computation of Expectation of slow function Optimization (Mathematical) jump , performance , optimization	13	1018	January 30, 2022
Help Improving Performance of a Loop Performance performance , loops	15	1083	February 16, 2021

Major performance boost when precaching random inputs to ```exp```?

Related topics