Possible inefficiency in randn!

benchmark
#1

The following two functions seem to do exactly the same thing, but the built-in version (randn!) is some 60% slower. Am I missing something or could randn! be implemented more efficiently?

using Random
broadcast_randn(x) = (x .= randn.())
inplace_randn(x) = randn!(x)

x = zeros(10_000)
@btime broadcast_randn($x)
  41.680 μs (0 allocations: 0 bytes)
@btime inplace_randn($x)
  68.003 μs (0 allocations: 0 bytes)
#2

Looking at the output of @code_llvm, it seems that the broadcasted version takes advantage of SIMD instructions. The base implementation of the in-place version is not too complicated:

function $randfun!(rng::AbstractRNG, A::AbstractArray{T}) where T
    for i in eachindex(A)
        @inbounds A[i] = $randfun(rng, T)
    end
    A
end

By including a @simd annotation, I recover the same performance as the broadcasted version:

function myrandn!(x)
   @inbounds @simd for i in eachindex(x)
       x[i] = randn()
   end
   x
end

However, @simd may not play well with the random number generator (are there memory dependencies based on the state of the RNG?), but the specifics are beyond me.

2 Likes
#3

This will not generate any SIMD instructions so it must be that the structure of the loop is slightly different with the SIMD macro which in this case appears to matter.

#4

Diffing the asm generated in both cases, the only difference is that the simd version calls julia_randn_unlikely_12356, whereas the non-simd version calls randn_unlikely. No idea what that means, but probably unintended?

edit: well, that’s not the only difference, there’s also a jne that gets swapped with a jb, but that probably shouldn’t matter?

#5

I was confused in the previous post, sorry. That has nothing to do with simd:

using Random
broadcast_randn(x) = (x .= randn.())
inplace_randn(x) = randn!(x)

function myrandn!(x)
    @inbounds @simd for i in eachindex(x)
        x[i] = randn()
    end
    x
end

function myrandn_nosimd!(x)
    @inbounds for i in eachindex(x)
        x[i] = randn()
    end
    x
end


x = zeros(10_000)
@btime broadcast_randn($x);
@btime inplace_randn($x);
@btime myrandn!($x);
@btime myrandn_nosimd!($x);

is only slow in inplace_randn. The real difference in the two cases is that randn isn’t inlined in randn!. This is despite randn being marked as @inline, so it looks like the inlining heuristics are being too conservative here.

Performance of filling an array
#6

Opened an issue: https://github.com/JuliaLang/julia/issues/31607

2 Likes