Unsafe_store! sometimes slower than arrays' setindex!?

While trying to fix the FIXMEs in “base/random/RNGs.jl”, I saw performance degraded in a very unexpected way. Basically it amounts to the following benchmark. I would be grateful if someone can point to me a workaround which makes the pointer version as fast as the array version!

function fillf(a::Array, n, f)
    for i=1:n
        @inbounds a[i] =  f()
    end
end

function fillf(a::Ptr, n, f)
    for i=1:n
        unsafe_store!(a, f(), i)
    end
end

a = zeros(500)
pa = pointer(a)

@btime fillf($a,  500, ()->1.0) # -> 45 ns
@btime fillf($pa, 500, ()->1.0) # -> 45 ns # similar
@btime fillf($a,  500, randn)   # -> 2.2 μs 
@btime fillf($pa, 500, randn)   # -> 2.2 μs # similar
@btime fillf($a,  500, rand)    # -> 1.1 us
@btime fillf($pa, 500, rand)    # -> 1.6 us  # ???

In this last case, how come the array version can be so much faster? is the compiler so clever to understand how rand works to the point of being able to exploit it together with some knowledge it has of the array?
Thanks for any hint!

Using unsafe functions forces the compiler to pessimize significantly. It is certainly possible that is the cause of the timing difference.

1 Like

Ok thanks a lot. So aren’t there tricks to mitigate the pessimization? On one hand, arrays must be avoided (in the FIXME I was referring to), but having no way to have a less safe version of them without performance regression is really surprising.

I think that code can now use the new reinterpret (returning a ReinterpretedArray) to get correct results.

1 Like

I could not reproduce; could you post the offending @code_llvm and @code_native?

Does the problem persist for @noinline g() = rand() ?

Naively, I’d guess that the main problem is that llvm should not know that the unsafe_store! does not overlap the internal state of rand, which might prevent loop unrolling?

PS. You can check this theory by manually unrolling the loop (e.g. first collect 4x rand into loop-local vars, then store all four)

1 Like

Are you using a recent master, or version 0.6 ?

Yes, and the timings are worse.

That’s great, indeed this solves the performance problem! I must admit I don’t understand the reason why unrolling helps here.

Since you asked, here is the code_llvm:

; Function fillf
; Location: REPL[3]:2
define void @julia_fillf_64598(i64, i64) {
top:
  %gcframe2 = alloca [3 x %jl_value_t addrspace(10)*], align 8
  %gcframe2.sub = getelementptr inbounds [3 x %jl_value_t addrspace(10)*], [3 x %jl_value_t addrspace(10)*]* %gcframe2, i64 0, i64 0
  %2 = getelementptr inbounds [3 x %jl_value_t addrspace(10)*], [3 x %jl_value_t addrspace(10)*]* %gcframe2, i64 0, i64 1
  %3 = bitcast %jl_value_t addrspace(10)** %2 to i8*
  call void @llvm.memset.p0i8.i32(i8* %3, i8 0, i32 16, i32 8, i1 false)
  %thread_ptr = call i8* asm "movq %fs:0, $0", "=r"() #2
  %ptls_i8 = getelementptr i8, i8* %thread_ptr, i64 -10920
  %4 = bitcast [3 x %jl_value_t addrspace(10)*]* %gcframe2 to i64*
  store i64 2, i64* %4, align 8
  %5 = getelementptr [3 x %jl_value_t addrspace(10)*], [3 x %jl_value_t addrspace(10)*]* %gcframe2, i64 0, i64 1
  %6 = bitcast i8* %ptls_i8 to i64*
  %7 = load i64, i64* %6, align 8
  %8 = bitcast %jl_value_t addrspace(10)** %5 to i64*
  store i64 %7, i64* %8, align 8
  %9 = bitcast i8* %ptls_i8 to %jl_value_t addrspace(10)***
  store %jl_value_t addrspace(10)** %gcframe2.sub, %jl_value_t addrspace(10)*** %9, align 8
  %10 = icmp slt i64 %1, 1
  br i1 %10, label %L50, label %if.lr.ph

if.lr.ph:                                         ; preds = %top
  %11 = getelementptr [3 x %jl_value_t addrspace(10)*], [3 x %jl_value_t addrspace(10)*]* %gcframe2, i64 0, i64 2
; Location: REPL[3]:3
  %12 = inttoptr i64 %0 to double*
; Location: REPL[3]:2
  br label %if

if:                                               ; preds = %if.lr.ph, %L37
  %"#temp#.03" = phi i64 [ 1, %if.lr.ph ], [ %25, %L37 ]
; Location: REPL[3]:3
; Function rand; {
; Location: random/random.jl:106
; Function rand; {
; Location: random/random.jl:106
; Function rand; {
; Location: random/RNGs.jl:239
; Function reserve_1; {
; Location: random/RNGs.jl:151
; Function mt_empty; {
; Location: random/RNGs.jl:141
  %13 = load i64, i64 addrspace(11)* bitcast (i8 addrspace(11)* getelementptr (i8, i8 addrspace(11)* addrspacecast (i8* inttoptr (i64 140294998674480 to i8*) to i8 addrspace(11)*), i64 24) to
 i64 addrspace(11)*), align 8
;}
  %14 = icmp eq i64 %13, 382
  br i1 %14, label %if1, label %L37

L50.loopexit:                                     ; preds = %L37
;}}}}
  br label %L50

L50:                                              ; preds = %L50.loopexit, %top
  %15 = load i64, i64* %8, align 8
  store i64 %15, i64* %6, align 8
  ret void

if1:                                              ; preds = %if
; Function rand; {
; Location: random/random.jl:106
; Function rand; {
; Location: random/random.jl:106
; Function rand; {
; Location: random/RNGs.jl:239
; Function reserve_1; {
; Location: random/RNGs.jl:151
; Function gen_rand; {
; Location: random/RNGs.jl:147
; Function macro expansion; {
; Location: pointer.jl:169
  %16 = load %jl_value_t addrspace(10)*, %jl_value_t addrspace(10)* addrspace(11)* bitcast (i8 addrspace(11)* getelementptr (i8, i8 addrspace(11)* addrspacecast (i8* inttoptr (i64 14029499867
4480 to i8*) to i8 addrspace(11)*), i64 8) to %jl_value_t addrspace(10)* addrspace(11)*), align 8
  %17 = load %jl_value_t addrspace(10)*, %jl_value_t addrspace(10)* addrspace(11)* bitcast (i8 addrspace(11)* getelementptr (i8, i8 addrspace(11)* addrspacecast (i8* inttoptr (i64 14029499867
4480 to i8*) to i8 addrspace(11)*), i64 16) to %jl_value_t addrspace(10)* addrspace(11)*), align 16
  %18 = addrspacecast %jl_value_t addrspace(10)* %17 to %jl_value_t addrspace(11)*
  %19 = bitcast %jl_value_t addrspace(11)* %18 to i64 addrspace(11)*
  %20 = load i64, i64 addrspace(11)* %19, align 8
  %21 = bitcast %jl_value_t addrspace(11)* %18 to %jl_array_t addrspace(11)*
  %22 = getelementptr inbounds %jl_array_t, %jl_array_t addrspace(11)* %21, i64 0, i32 1
  %23 = load i64, i64 addrspace(11)* %22, align 8
  store %jl_value_t addrspace(10)* %16, %jl_value_t addrspace(10)** %11, align 8
  call void @"jlsys_dsfmt_fill_array_close1_open2!_36940"(%jl_value_t addrspace(10)* nonnull %16, i64 %20, i64 %23)
;}
; Location: random/RNGs.jl:148
; Function mt_setfull!; {
; Location: random/RNGs.jl:142
  store i64 0, i64 addrspace(11)* bitcast (i8 addrspace(11)* getelementptr (i8, i8 addrspace(11)* addrspacecast (i8* inttoptr (i64 140294998674480 to i8*) to i8 addrspace(11)*), i64 24) to i6
4 addrspace(11)*), align 8
;}}
  br label %L37
L37:                                              ; preds = %if, %if1
;}
; Function rand_inbounds; {
; Location: random/RNGs.jl:216
; Function rand_inbounds; {
; Location: random/RNGs.jl:215
; Function mt_pop!; {
; Location: random/RNGs.jl:144
  %24 = phi i64 [ %13, %if ], [ 0, %if1 ]
;}}}}}}
; Location: REPL[3]:2
  %25 = add i64 %"#temp#.03", 1
; Location: REPL[3]:3
; Function rand; {
; Location: random/random.jl:106
; Function rand; {
; Location: random/random.jl:106
; Function rand; {
; Location: random/RNGs.jl:239
; Function rand_inbounds; {
; Location: random/RNGs.jl:216
; Function rand_inbounds; {
; Location: random/RNGs.jl:215
; Function mt_pop!; {
; Location: random/RNGs.jl:144
  %26 = load %jl_value_t addrspace(10)*, %jl_value_t addrspace(10)* addrspace(11)* bitcast (i8 addrspace(11)* getelementptr (i8, i8 addrspace(11)* addrspacecast (i8* inttoptr (i64 14029499867
4480 to i8*) to i8 addrspace(11)*), i64 16) to %jl_value_t addrspace(10)* addrspace(11)*), align 16
  %27 = add i64 %24, 1
  store i64 %27, i64 addrspace(11)* bitcast (i8 addrspace(11)* getelementptr (i8, i8 addrspace(11)* addrspacecast (i8* inttoptr (i64 140294998674480 to i8*) to i8 addrspace(11)*), i64 24) to
i64 addrspace(11)*), align 8
;}}}}}}
  %28 = addrspacecast %jl_value_t addrspace(10)* %26 to %jl_value_t addrspace(11)*
  %29 = bitcast %jl_value_t addrspace(11)* %28 to double* addrspace(11)*
  %30 = load double*, double* addrspace(11)* %29, align 8
  %31 = getelementptr double, double* %30, i64 %24
  %32 = load double, double* %31, align 8
  %33 = fadd double %32, -1.000000e+00
  %34 = add i64 %"#temp#.03", -1
  %35 = getelementptr double, double* %12, i64 %34
  store double %33, double* %35, align 1
; Location: REPL[3]:2
  %36 = icmp eq i64 %"#temp#.03", %1
  br i1 %36, label %L50.loopexit, label %if
}

And code_native:

       .text
; Function fillf {
; Location: REPL[3]:2
        pushq   %rbp
        pushq   %r15
        pushq   %r14
        pushq   %r13
        pushq   %r12
        pushq   %rbx
        subq    $40, %rsp
        movq    %rsi, %r14
        movq    %rdi, %rbx
        vxorpd  %xmm0, %xmm0, %xmm0
        vmovupd %xmm0, 24(%rsp)
        movq    %fs:0, %rcx
        movq    $2, 16(%rsp)
        movq    -10920(%rcx), %rax
        movq    %rax, 24(%rsp)
        leaq    16(%rsp), %rax
        movq    %rcx, (%rsp)
        movq    %rax, -10920(%rcx)
        testq   %r14, %r14
        jle     L245
; Location: REPL[3]:3
; Function rand; {
; Location: random.jl:106
; Function rand; {
; Location: random.jl:106
; Function rand; {
; Location: RNGs.jl:239
; Function reserve_1; {
; Location: RNGs.jl:151
; Function mt_empty; {
; Location: RNGs.jl:141
        movabsq $jl_system_image_data, %r12
;}
; Function gen_rand; {
; Location: RNGs.jl:147
; Function macro expansion; {
; Location: pointer.jl:169
        movabsq $jl_system_image_data, %r13
        movabsq $jl_system_image_data, %rbp
        movabsq $"dsfmt_fill_array_close1_open2!", %r15
        movabsq $140294849439648, %rax  # imm = 0x7F98F0AA27A0
        vmovsd  (%rax), %xmm1           # xmm1 = mem[0],zero
        vmovsd  %xmm1, 8(%rsp)
        nopw    %cs:(%rax,%rax)
;}}
; Function mt_empty; {
; Location: RNGs.jl:141
L160:
        movq    (%r12), %rax
;}
        cmpq    $382, %rax              # imm = 0x17E
        jne     L211
; Function gen_rand; {
; Location: RNGs.jl:147
; Function macro expansion; {
; Location: pointer.jl:169
        movq    (%r13), %rdi
        movq    (%rbp), %rax
        movq    (%rax), %rsi
        movq    8(%rax), %rdx
        movq    %rdi, 32(%rsp)
        callq   *%r15
        vmovsd  8(%rsp), %xmm1          # xmm1 = mem[0],zero
;}
; Location: RNGs.jl:148
; Function mt_setfull!; {
; Location: RNGs.jl:142
        movq    $0, (%r12)
;}}}
; Function rand_inbounds; {
; Location: RNGs.jl:216
; Function rand_inbounds; {
; Location: RNGs.jl:215
; Function mt_pop!; {
; Location: RNGs.jl:144
        xorl    %eax, %eax
L211:
        movq    (%rbp), %rcx
        leaq    1(%rax), %rdx
        movq    %rdx, (%r12)
;}}}}}}
        movq    (%rcx), %rcx
        vaddsd  (%rcx,%rax,8), %xmm1, %xmm0
        vmovsd  %xmm0, (%rbx)
; Location: REPL[3]:2
        addq    $8, %rbx
        addq    $-1, %r14
        jne     L160
; Location: REPL[3]:3
L245:
        movq    24(%rsp), %rax
        movq    (%rsp), %rcx
        movq    %rax, -10920(%rcx)
        addq    $40, %rsp
        popq    %rbx
        popq    %r12
        popq    %r13
        popq    %r14
        popq    %r15
        popq    %rbp
        retq
        nopw    %cs:(%rax,%rax)
;}

I’m mostly using 0.6.

About the @nolinline g() = rand() test: If my theory was correct, then I would expect both fills to perform equally bad (and slower that filling with rand). I got this theory by observing that randn is not inlined, and ()->1.0 has no internal state that can alias the unsafe_store!.

But I’m happy that the unrolling helped you. Unfortunately I don’t know how to pass aliasing info to llvm, and manual unrolling does not yield nice code. Maybe try @simd?

I don’t think you will get simd code, but this may smuggle an “undefined if aliased” assumption into the compiler.

Actually you are right about that too (I think I mixed the function to benchmark with the unrolled one).

It doesn’t seem to help.

Anyway, thanks for your help; unrolled code is indeed not nice, but it’s good to know it’s possible if you need the last bit of performance.

So, do you believe that the julia implementation of builtin arrays pass aliasing info to llvm?

It does. See e.g. https://github.com/JuliaLang/julia/pull/21262

1 Like

Wow, I would never have found it myself, thanks! There were no random benchmarks yet at that time, so Nanosoldier didn’t have a chance to notice a perf improvement.

For future reference (for someone like me not familiar with those questions), the lack of knowledge that two arrays don’t alias prevent the optimizer to reorder instructions (like what @foobar_lv2 proposed, i.e. “first collect 4x rand into loop-local vars, then store all four”); it’s not only unrolling which matters, as simply unrolling by manually writing 4 unsafe_store!(a, rand(), some_indice) lines doesn’t help with the performance. So my guess is that ultimately, it’s a matter a cache misses or something like this?

Thank you all for helping me understand more on this topic, and there may be hope to have a julia-level way to annotate aliasing information (https://github.com/JuliaLang/julia/issues/19658) !

I don’t think it is a cache miss, but rather that the compiler cannot know if the unsafe_store! stores to the location you are reading from next, so you need to read that location from the stack again after to unsafe_store!. In other words, the instruction has a dependency on the previous one so reordering instructions could change the behavior of the program.

1 Like

as simply unrolling by manually writing 4 unsafe_store!(a, rand(), some_indice) lines doesn’t help with the performance

I think this is about the random number generation. Your loop:
(1) reads global rng state (module-level global)
(2) does some computations (mersenne twisting)
(3) writes new rng state
(4) unsafe_store! the random number.

Unroll this:

(1a) reads global rng state (module-level global)
(2a) does some computations (mersenne twisting)
(3a) writes new rng state
(4a) unsafe_store! the random number.
(1b) reads global rng state (module-level global)
(2b) does some computations (mersenne twisting)
(3b) writes new rng state
(4b) unsafe_store! the random number.

Now, if we know that the unsafe_store! does not point to the module-level global RNG state, then we can skip (3a) and (1b). These are in cache anyway, but register is faster than L1.

Re kristoffer: I think that this would not be a big problem for local variables. The reason is that many CPUs have some hardware-support to keep stack variables in register (the code_native is lying!) and do some panicked roll-back / pipeline flushing if an unexpected alias happens. But don’t nail me on this, I have not read my CPU’s optimization manual. Oh, but remembering about register-renaming, I would not be terribly surprised if some weird CPU somehow manages to magically make the badly-unrolled loop fast.

Hence, my advice that you need to separate the rand() calls and the unsafe_store! in the unrolled loop.

PS. AFAIK, the lying code_native is the only reason why the register-starved x86 is performant at all: In reality, it is not register-starved, the registers are just not user-addressable and instead used by the final opaque optimization process that the CPU does. This even makes sense, because (1) many compilers are not that smart, and (2) this allows code written for old compilers to make use of new CPUs.

1 Like

[sorry for double posting]
@kristoffer.carlsson

I did not think of this, but this does possibly affect the fastview in https://github.com/JuliaStats/Distances.jl/pull/86.

In other words, we should expect the split_view API to be more performant than fastview because of better aliasing info. And while I would not be opposed to implementing the fastview in llvm, I see no way in https://llvm.org/docs/LangRef.html#noalias of expressing the needed info (because we can’t access the names of the aliasing scopes of surrounding code if we are inlined). Maybe someone else here is better at llvm?

I still think that fastview now, and hopefully view for later julia versions, is a more pragmatic approach, even though @yuyichao would probably disagree.

PS. At least in my testing, the split_view was not faster; but whether the aliasing info helps is very context-dependent (and probably also CPU and specific julia-version dependent)

Since the view problem seems fixed on 0.7 (as long as evaluate is inlined), which will be released in a timespan on the order of months it might be best just to scrap the fastview and just wait for the optimizations.

Sure that the 4x performance gain in evaluate for 0.6 users is not worth the change? It is somehow hacky, true (or was my testing wrong? Apparently there were two issues: The allocating view and the missing @inline for the fast path)

I don’t have such a strong opinion anymore. I myself can use the fastview-branch for computations and profiling, and since I now know that evaluate will become fast soon, its ok for my stuff to depend on distances.jl.