@inbounds code slower than one without

Hello everyone,
I was just noticing that this simple code with @inbounds is slower than the one without:

using BenchmarkTools

iterations = 10^7
simple_vec = zeros(Float64, iterations)

function test1(v::Vector{Float64}, iter::Int64)
    for i = 1 : iter
        v[i] = Float64(i)
    end
end

function test2(v::Vector{Float64}, iter::Int64)
    @inbounds for i = 1 : iter
        v[i] = Float64(i)
    end
end

@btime test1(simple_vec, iterations) =
5.902 ms (0 allocations: 0 bytes)

@btime test2(simple_vec, iterations) =
7.053 ms (0 allocations: 0 bytes)

Why is the code with @inbounds slower? Am I using it wrong?

Interesting. The test2 is using SIMD instructions whereas the test1 isn’t. That’s often precisely what you want to have happen with @inbounds. And in fact this is behaving how you’d expect on smaller arrays:

julia> iterations = 10^3
1000

julia> @btime test1(simple_vec, iterations)
  813.767 ns (0 allocations: 0 bytes)

julia> @btime test2(simple_vec, iterations)
  117.695 ns (0 allocations: 0 bytes)

That’s a near perfect 8x speedup — as I’d expect from this AVX512 machine. But once I start exceeding my CPU cache sizes, I start seeing the SIMD-ification gains slow down. This makes sense: both functions become limited by your memory speed, and since the cost of the bounds check is negligible I’d naively expect the two functions to have exactly the same speed with very large arrays like this.

I’m not a hardware expert, but I’d guess that the “cost” of a pipeline stall (as it waits for memory to be delivered) is more expensive on the SIMD unit.

5 Likes

On my machine I get:

julia> iterations = 10^3
1000

julia> @btime test1(simple_vec, iterations)
  535.317 ns (0 allocations: 0 bytes)

julia> @btime test2(simple_vec, iterations)
  658.112 ns (0 allocations: 0 bytes)

Apparently, SIMD instructions are giving me a noticeable performance hit. Is there a way of only turning off bounds checking without any SIMD instructions being used with @inbounds?

I’m noticing a huge expansion on the Float64(Int64) on the @inbounds version. I don’t know x86 assembly much at all, but it looks like it’s doing a vector convert from Int to Float, and then extracting the Int back from the SIMD register for the indexing?

Going from

; ┌ @ REPL[34]:3 within `test1'
; │┌ @ float.jl:60 within `Type'
        leaq    1(%rax), %r8
        vcvtsi2sdq      %r8, %xmm1, %xmm0

going to

; │ @ REPL[35]:3 within `test2'
; │┌ @ float.jl:60 within `Type'
        vextracti128    $1, %ymm0, %xmm6
        vpextrq $1, %xmm6, %rax
        vcvtsi2sdq      %rax, %xmm12, %xmm1
        vmovq   %xmm6, %rax
        vcvtsi2sdq      %rax, %xmm12, %xmm6
        vmovlhps        %xmm1, %xmm6, %xmm1 # xmm1 = xmm6[0],xmm1[0]
        vpextrq $1, %xmm0, %rax
        vcvtsi2sdq      %rax, %xmm12, %xmm6
        vmovq   %xmm0, %rax
        vcvtsi2sdq      %rax, %xmm12, %xmm2
        vmovlhps        %xmm6, %xmm2, %xmm2 # xmm2 = xmm2[0],xmm6[0]
        vinsertf128     $1, %xmm1, %ymm2, %ymm6
        vextracti128    $1, %ymm8, %xmm1
        vpextrq $1, %xmm1, %rax
        vcvtsi2sdq      %rax, %xmm12, %xmm2
        vmovq   %xmm1, %rax
        vcvtsi2sdq      %rax, %xmm12, %xmm1
        vmovlhps        %xmm2, %xmm1, %xmm1 # xmm1 = xmm1[0],xmm2[0]
        vpextrq $1, %xmm8, %rax
        vcvtsi2sdq      %rax, %xmm12, %xmm2
        vmovq   %xmm8, %rax
        vcvtsi2sdq      %rax, %xmm12, %xmm3
        vmovlhps        %xmm2, %xmm3, %xmm2 # xmm2 = xmm3[0],xmm2[0]
        vinsertf128     $1, %xmm1, %ymm2, %ymm1
        vextracti128    $1, %ymm7, %xmm2
        vpextrq $1, %xmm2, %rax
        vcvtsi2sdq      %rax, %xmm12, %xmm3
        vmovq   %xmm2, %rax
        vcvtsi2sdq      %rax, %xmm12, %xmm2
        vmovlhps        %xmm3, %xmm2, %xmm2 # xmm2 = xmm2[0],xmm3[0]
        vpextrq $1, %xmm7, %rax
        vcvtsi2sdq      %rax, %xmm12, %xmm3
        vmovq   %xmm7, %rax
        vcvtsi2sdq      %rax, %xmm12, %xmm7
        vmovlhps        %xmm3, %xmm7, %xmm3 # xmm3 = xmm7[0],xmm3[0]
        vinsertf128     $1, %xmm2, %ymm3, %ymm2
        vextracti128    $1, %ymm5, %xmm3
        vpextrq $1, %xmm3, %rax
        vcvtsi2sdq      %rax, %xmm12, %xmm7
        vmovq   %xmm3, %rax
        vcvtsi2sdq      %rax, %xmm12, %xmm3
        vmovlhps        %xmm7, %xmm3, %xmm3 # xmm3 = xmm3[0],xmm7[0]
        vpextrq $1, %xmm5, %rax
        vcvtsi2sdq      %rax, %xmm12, %xmm7
        vmovq   %xmm5, %rax
        vcvtsi2sdq      %rax, %xmm12, %xmm5
        vmovlhps        %xmm7, %xmm5, %xmm5 # xmm5 = xmm5[0],xmm7[0]

That’s repeating the same instruction many times since x86 doesn’t have a vector instruction to convert int64 to float64.

2 Likes

This has has been discussed before here. AVX512, like @mbauman’s machine, does have an instruction to convert Int64 to Float64, therefore it is fast for him.

AVX2 and earlier do not have the instruction. Using Int32 instead should work for these architectures, assuming

julia> typemax(Int32)
2147483647

is big enough.

2 Likes

As for the effect that @mbauman points out, that memory speed becomes the limiting factor as the array grows, there’s a way to mitigate that by using non-temporal memory writes. First observe the “fixed” loops for non-AVX512 CPUs:

function test1(a, n)
    for i = Int32(1):Int32(n)
        a[i] = i
    end
end

function test2(a, n)
    @inbounds for i = Int32(1):Int32(n)
        a[i] = i
    end
end

Performance for small arrays on AVX2 is now good:

julia> a = zeros(Float64, 10^7);

julia> @btime test1($a, 1000)
  563.775 ns (0 allocations: 0 bytes)

julia> @btime test2($a, 1000)
  81.377 ns (0 allocations: 0 bytes)

This corresponds to 0.56 and 0.08 ns / element. For larger arrays, performance deteriorates:

julia> @btime test1($a, 10^7)
  8.189 ms (0 allocations: 0 bytes)

julia> @btime test2($a, 10^7)
  5.930 ms (0 allocations: 0 bytes)

This is 0.82 and 0.59 ns / element. Now, the non-temporal AVX2 version:

using SIMD

function test_nt!(a, n)
    v = Vec{4,Float64}((0, 1, 2, 3))
    @simd for i = 1:4:n-3
        vstorent(i + v, a, i)
    end
    # handle remaining 0-3 elements
    @inbounds for i = (n&~3)+1:n
        a[i] = i
    end
end

Verification:

julia> a = rand(10^7); test2(a, 10^7)

julia> b = rand(10^7); test_nt!(b, 10^7);

julia> a == b
true

Performance:

julia> @btime test_nt!($a, 10^7)
  2.470 ms (0 allocations: 0 bytes)

0.25 ns / element; more than 2x as fast as the previous SIMD version.

4 Likes

It’s worth pointing out thought that this is only true when n is large.

julia> @btime test1($a, 200)
  236.358 ns (0 allocations: 0 bytes)

julia> @btime test2($a, 200)
  66.030 ns (0 allocations: 0 bytes)

julia> @btime test_nt!($a, 200)
  162.201 ns (0 allocations: 0 bytes)

julia> versioninfo()
Julia Version 1.2.0-DEV.377
Commit 4b01e6a60f (2019-02-26 16:06 UTC)
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: Intel(R) Core(TM) i3-4010U CPU @ 1.70GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-6.0.1 (ORCJIT, haswell)

because it evicts the stored data from the cache.

1 Like

Yes, maybe I should have made that more clear, the non-temporal write will bypass the cache, which is useful if you’re writing large amounts of data, but I presume that it will also be beneficial if writing small amounts of data that you don’t need or want cached (i.e. you won’t be using it again soon).

The @btime above for 200 elements will end up running the same test millions of time, writing to the same memory space, so caching is helpful. If instead we are writing to a large range of memory, but still n = 200 elements at a time, the non-temporal version is faster:

julia> A = [zeros(Float64, 256) for n = 1:10_000];

julia> @btime for i=1:length($A); test2($A[i], 200); end
  1.438 ms (0 allocations: 0 bytes)

julia> @btime for i=1:length($A); test_nt!($A[i], 200); end
  665.409 ÎĽs (0 allocations: 0 bytes)

We can see the effect of caching by instead creating A with a repeated vector. Note a ~8x increase in performance in test2 while the non-temporal version is about the same:

julia> A = fill(zeros(Float64, 256), 10_000);

julia> @btime for i=1:length($A); test2($A[i], 200); end
  186.484 ÎĽs (0 allocations: 0 bytes)

julia> @btime for i=1:length($A); test_nt!($A[i], 200); end
  625.263 ÎĽs (0 allocations: 0 bytes)
4 Likes

Great benchmarks!

I bet that could help a native-Julia BLAS implementation, and be easier than messing with prefetch statements (which seemed very finicky with respect to timing between architectures).
Elements from one of the matrices is normally broadcast, so perhaps nontemporal prefetches would also be appropriate.

I never realized (or imagined) before how much that could help.
Whenever you’re working with lots of data, there are probably lots of situations where they would be appropriate.

2 Likes

In the function:

Is the @simd macro needed? If so why?

Should array a be allocated using the SIMD.jl allocator valloc so the memory is aligned? I was having some problems using vstorent for a problem if I didn’t do this.

Thanks!

1 Like
  1. @simd is unecessary.
  2. Yes, it has to be aligned. See the definition here. If you’d rather not valloc, you could instead vstore(v, arr, i, Val{false}, Val{true}). From what I have read, unaligned vs aligned makes no performance difference on recent architectures, but it did on older ones. I don’t like segfaults, so I stick to unaligned loads/stores.

Arrays above a certain size are 64-byte alligned. I don’t recall the size, but

alignment(N) = UInt(pointer(Array{Float64}(undef, N))) % 64

Always returns 0 for N = 250, but is random multiples of 10 for N = 200.

3 Likes

Great. Thanks!

I didn’t think so, but @simd produces more efficient code on my machine – consistently about 15% faster. Looking at the native code, it’s basically doing the same thing, but the @simd version shuffles around the code a bit. I haven’t analyzed it in detail, but I’m guessing that it manages to separate dependent instructions for more efficient pipelining.

Good point about aligning and valloc! I didn’t have success with vstore(i + v, a, i, Val{false}, Val{true}) though; for me (on an AVX2 Skylake) it produces just as slow code as the non-temporal version (test2 above).

Alright, for posterity, and since the documentation for valloc seems quite sparse, here’s the version I ended up with. Changing to use a pointer, @simd no longer makes a difference for me. I also optimized the iterator according to our recent discussion. Finally, with valloc, there’s no need to take care of the remaining elements at the end, since there’s extra room at the end to write an additional vector element.

function test_nt!(a::SubArray, n)
    v = Vec{4,Float64}((0, 1, 2, 3))
    for i = 0:(n-1)>>2
        vstorent(4i + 1 + v, pointer(a, 4i + 1))
    end
end

a = valloc(Float64, 8, 199)
test_nt!(a, 199)

Wouldn’t this be the correct upper bound calculation?

julia> u(n) = (n-1) >> 2
u (generic function with 1 method)

julia> 4 .* u.(195:201) .+ 1
7-element Array{Int64,1}:
 193
 193
 197
 197
 197
 197
 201

197 would result in calculating indices (197,198,199,200).

Interestingly, I can confirm that the temporal stores do seem to require alignment for performance (I also made sure that vstorea results in the same performance as test2 on an avx2 computer).

You’re absolutely correct, thanks for pointing that out! Updated above.

Thank you all for the insightful responses! It is all now much clearer to me. For my machine, I will then use Int32 → Float64 conversion