While trying to fix the FIXMEs in “base/random/RNGs.jl”, I saw performance degraded in a very unexpected way. Basically it amounts to the following benchmark. I would be grateful if someone can point to me a workaround which makes the pointer version as fast as the array version!
function fillf(a::Array, n, f)
for i=1:n
@inbounds a[i] = f()
end
end
function fillf(a::Ptr, n, f)
for i=1:n
unsafe_store!(a, f(), i)
end
end
a = zeros(500)
pa = pointer(a)
@btime fillf($a, 500, ()->1.0) # -> 45 ns
@btime fillf($pa, 500, ()->1.0) # -> 45 ns # similar
@btime fillf($a, 500, randn) # -> 2.2 μs
@btime fillf($pa, 500, randn) # -> 2.2 μs # similar
@btime fillf($a, 500, rand) # -> 1.1 us
@btime fillf($pa, 500, rand) # -> 1.6 us # ???
In this last case, how come the array version can be so much faster? is the compiler so clever to understand how rand works to the point of being able to exploit it together with some knowledge it has of the array?
Thanks for any hint!
Ok thanks a lot. So aren’t there tricks to mitigate the pessimization? On one hand, arrays must be avoided (in the FIXME I was referring to), but having no way to have a less safe version of them without performance regression is really surprising.
I could not reproduce; could you post the offending @code_llvm and @code_native?
Does the problem persist for @noinline g() = rand() ?
Naively, I’d guess that the main problem is that llvm should not know that the unsafe_store! does not overlap the internal state of rand, which might prevent loop unrolling?
PS. You can check this theory by manually unrolling the loop (e.g. first collect 4x rand into loop-local vars, then store all four)
About the @nolinline g() = rand() test: If my theory was correct, then I would expect both fills to perform equally bad (and slower that filling with rand). I got this theory by observing that randn is not inlined, and ()->1.0 has no internal state that can alias the unsafe_store!.
But I’m happy that the unrolling helped you. Unfortunately I don’t know how to pass aliasing info to llvm, and manual unrolling does not yield nice code. Maybe try @simd?
I don’t think you will get simd code, but this may smuggle an “undefined if aliased” assumption into the compiler.
Wow, I would never have found it myself, thanks! There were no random benchmarks yet at that time, so Nanosoldier didn’t have a chance to notice a perf improvement.
For future reference (for someone like me not familiar with those questions), the lack of knowledge that two arrays don’t alias prevent the optimizer to reorder instructions (like what @foobar_lv2 proposed, i.e. “first collect 4x rand into loop-local vars, then store all four”); it’s not only unrolling which matters, as simply unrolling by manually writing 4 unsafe_store!(a, rand(), some_indice) lines doesn’t help with the performance. So my guess is that ultimately, it’s a matter a cache misses or something like this?
Thank you all for helping me understand more on this topic, and there may be hope to have a julia-level way to annotate aliasing information (https://github.com/JuliaLang/julia/issues/19658) !
I don’t think it is a cache miss, but rather that the compiler cannot know if the unsafe_store! stores to the location you are reading from next, so you need to read that location from the stack again after to unsafe_store!. In other words, the instruction has a dependency on the previous one so reordering instructions could change the behavior of the program.
as simply unrolling by manually writing 4 unsafe_store!(a, rand(), some_indice) lines doesn’t help with the performance
I think this is about the random number generation. Your loop:
(1) reads global rng state (module-level global)
(2) does some computations (mersenne twisting)
(3) writes new rng state
(4) unsafe_store! the random number.
Unroll this:
(1a) reads global rng state (module-level global)
(2a) does some computations (mersenne twisting)
(3a) writes new rng state
(4a) unsafe_store! the random number.
(1b) reads global rng state (module-level global)
(2b) does some computations (mersenne twisting)
(3b) writes new rng state
(4b) unsafe_store! the random number.
Now, if we know that the unsafe_store! does not point to the module-level global RNG state, then we can skip (3a) and (1b). These are in cache anyway, but register is faster than L1.
Re kristoffer: I think that this would not be a big problem for local variables. The reason is that many CPUs have some hardware-support to keep stack variables in register (the code_native is lying!) and do some panicked roll-back / pipeline flushing if an unexpected alias happens. But don’t nail me on this, I have not read my CPU’s optimization manual. Oh, but remembering about register-renaming, I would not be terribly surprised if some weird CPU somehow manages to magically make the badly-unrolled loop fast.
Hence, my advice that you need to separate the rand() calls and the unsafe_store! in the unrolled loop.
PS. AFAIK, the lying code_native is the only reason why the register-starved x86 is performant at all: In reality, it is not register-starved, the registers are just not user-addressable and instead used by the final opaque optimization process that the CPU does. This even makes sense, because (1) many compilers are not that smart, and (2) this allows code written for old compilers to make use of new CPUs.
In other words, we should expect the split_view API to be more performant than fastview because of better aliasing info. And while I would not be opposed to implementing the fastview in llvm, I see no way in https://llvm.org/docs/LangRef.html#noalias of expressing the needed info (because we can’t access the names of the aliasing scopes of surrounding code if we are inlined). Maybe someone else here is better at llvm?
I still think that fastview now, and hopefully view for later julia versions, is a more pragmatic approach, even though @yuyichao would probably disagree.
PS. At least in my testing, the split_view was not faster; but whether the aliasing info helps is very context-dependent (and probably also CPU and specific julia-version dependent)
Since the view problem seems fixed on 0.7 (as long as evaluate is inlined), which will be released in a timespan on the order of months it might be best just to scrap the fastview and just wait for the optimizations.
Sure that the 4x performance gain in evaluate for 0.6 users is not worth the change? It is somehow hacky, true (or was my testing wrong? Apparently there were two issues: The allocating view and the missing @inline for the fast path)
I don’t have such a strong opinion anymore. I myself can use the fastview-branch for computations and profiling, and since I now know that evaluate will become fast soon, its ok for my stuff to depend on distances.jl.