Iterating over range is slower than while loop

using BenchmarkTools, SIMD

function test(x)
    N = 8

    c = Vec{N, Float32}(0)
    lane = VecRange{N}(0)

    @inbounds @fastmath for i in 1:N:length(x)
        c += x[lane + i]
    end
    return sum(c)
end

function test4(x)
    N = 8

    c = Vec{N, Float32}(0)
    lane = VecRange{N}(0)

    i = 1

    @inbounds @fastmath while i ≀ length(x)
        c += x[lane + i]
        i += N
    end
    
    return sum(c)
end

x = rand(Float32, 800)

On Julia 1.9 and 1.8.5 the while loop takes 70ns and the for loop 82ns. Looking at the llvm/native code the main difference seems to be that the for loop code calls steprange_last and it doesn’t get inlined whilst the while loop doesn’t:

top:
  %1 = bitcast {}* %0 to { i8*, i64, i16, i16, i32 }*
  %2 = getelementptr inbounds { i8*, i64, i16, i16, i32 }, { i8*, i64, i16, i16, i32 }* %1, i64 0, i32 1 
  %3 = load i64, i64* %2, align 8
  %4 = call i64 @j_steprange_last_2487(i64 signext 1, i64 signext 8, i64 signext %3) #0
  %5 = icmp slt i64 %4, 1
  br i1 %5, label %L84, label %L18.preheader

top:
  %1 = bitcast {}* %0 to { i8*, i64, i16, i16, i32 }*
  %2 = getelementptr inbounds { i8*, i64, i16, i16, i32 }, { i8*, i64, i16, i16, i32 }* %1, i64 0, i32 1 
  %3 = load i64, i64* %2, align 8
  %.not9 = icmp eq i64 %3, 0
  br i1 %.not9, label %L64, label %L10.lr.ph

If I replace length(x) with 800 in the for loop, the performance matches the while loop and we can see that there is no call to steprange_last made in the function.
I also tested using Float64 instead but the gap remained.

So the problem seems to be that the call to steprange_last is not inlined, is there any way to fix this?

Β 

Here’s an example not using SIMD (using eachindex(x) or 1:length(x) seem equally bad)

function test7(x)
    c = 0

    i = 1

    while i ≀ length(x)
        c += x[i]

        i += 1
    end

    return c
end

function test8(x)
    c = 0

    for i in eachindex(x)
        c += x[i]
    end

    return c
end

Without inbounds the gap is even bigger with the while loop taking 630ns and the for loop 2.1 microseconds. Adding @inbounds to the for loop brings it down to 630ns.
I don’t understand why the llvm code for the non inbounds for loop is so different or why the while loop doesn’t have the same problem.

For loop without @inbounds (the actual loop part)

L43:                                              ; preds = %idxend  %7 = add nuw nsw i64 %value_phi333, 1
  %8 = bitcast double %value_phi9 to i64
  br label %idxend

idxend:                                           ; preds = %L43, %idxend.lr.ph
  %9 = phi i64 [ 0, %idxend.lr.ph ], [ %value_phi333, %L43 ]       
  %value_phi534 = phi i1 [ true, %idxend.lr.ph ], [ false, %L43 ]  
  %value_phi333 = phi i64 [ 1, %idxend.lr.ph ], [ %7, %L43 ]
  %.sroa.021.032 = phi i64 [ 0, %idxend.lr.ph ], [ %8, %L43 ]      
  %10 = getelementptr inbounds double, double* %6, i64 %9
  %11 = load double, double* %10, align 8
  %12 = bitcast i64 %.sroa.021.032 to double
  %13 = sitofp i64 %.sroa.021.032 to double
  %.pn = select i1 %value_phi534, double %13, double %12
  %value_phi9 = fadd double %.pn, %11
  %.not = icmp eq i64 %value_phi333, %4
  br i1 %.not, label %union_move, label %L43

With @inbounds

L43:                                              ; preds = %L43, %L13.preheader
  %value_phi935 = phi double [ %value_phi9, %L43 ], [ %value_phi932, %L13.preheader ]
  %value_phi334 = phi i64 [ %8, %L43 ], [ 1, %L13.preheader ]      
  %8 = add i64 %value_phi334, 1
  %9 = getelementptr inbounds double, double* %6, i64 %value_phi334  %10 = load double, double* %9, align 8
  %value_phi9 = fadd double %value_phi935, %10
  %.not.not25 = icmp eq i64 %8, %4
  br i1 %.not.not25, label %union_move, label %L43

In the first example, it’s not an issue of inlining, it’s an issue of compile-time execution. N=8 is in the scope, so if you changed length(x) to 800, you have a constant 1:8:800, so the construction of that StepRange along with the steprange_last call can occur at compile-time. 1:8:length(x) on the other hand must wait until runtime to get the necessary inputs, so the constructor code is showing up. This necessarily takes longer than the while loop version because steprange_last is additionally computing the inclusive upper bound e.g. 1:3:9 is computed to 1:3:7; I assume this is for comparisons. I wouldn’t worry about this though, the construction is not occurring in the loop, so the overhead should not scale with x.

I’m prefacing this with saying that I’m not entirely sure if I’m right. The 2nd example seems to be suggesting that the compiler is automatically recognizing the while loop doesn’t need bounds checking, and if that’s the case I wouldn’t know how it’s deciding that. I suppose you could check by annotating @inbounds c += x[i] there and see if it doesn’t make a difference.

You should also provide the exact @code_llvm line, the function call’s input types absolutely affect what code is produced. So far I’m assuming something like rand(1000).

1 Like

The overhead of steprange_last is a problem for me because I plan on calling a similar function trillions of times. While the length isn’t changing it would be annoying to precompute this iterator and then pass it into a bunch of functions.

I should have made it clearer but I’m using x = rand(Float32, 800) everywhere. Using @inbounds on the while loop made no difference to the speed but looking at the llvm it doesn’t seem to be skipping the bounds checking

Without @inbounds the loop part is this

L9:                                               ; preds = %idxend, %idxend.peel
  %value_phi25 = phi i64 [ %14, %idxend ], [ 2, %idxend.peel ]
  %8 = phi float [ %value_phi5, %idxend ], [ %value_phi5.peel, %idxend.peel ]
  %9 = add nsw i64 %value_phi25, -1
  %10 = icmp ult i64 %9, %4
  br i1 %10, label %idxend, label %oob

oob:                                              ; preds = %L9
  %11 = alloca i64, align 8
  store i64 %value_phi25, i64* %11, align 8
  call void @ijl_bounds_error_ints({}* %1, i64* nonnull %11, i64 1)
  unreachable

idxend:                                           ; preds = %L9
  %12 = getelementptr inbounds float, float* %6, i64 %9
  %13 = load float, float* %12, align 4
  %value_phi5 = fadd float %8, %13
  %14 = add nuw nsw i64 %value_phi25, 1
  %.not.not = icmp ult i64 %value_phi25, %4
  br i1 %.not.not, label %L9, label %union_move

With @inbounds

L9:                                               ; preds = %L9, %L9.lr.ph
  %value_phi22 = phi i64 [ %12, %L9 ], [ 2, %L9.lr.ph 
]
  %8 = phi float [ %value_phi5, %L9 ], [ %value_phi5.peel, %L9.lr.ph ]
  %9 = add nsw i64 %value_phi22, -1
  %10 = getelementptr inbounds float, float* %6, i64 %9
  %11 = load float, float* %10, align 4
  %value_phi5 = fadd float %8, %11
  %12 = add nuw nsw i64 %value_phi22, 1
  %exitcond = icmp eq i64 %value_phi22, %4
  br i1 %exitcond, label %union_move, label %L9 

Here is the full @code_llvm for x=rand(Float32, 800) on Julia 1.9.0-rc1

function test(x)
    N = 8

    c = Vec{N, Float32}(0)
    lane = VecRange{N}(0)

    @inbounds @fastmath for i in 1:N:length(x)
        c += x[lane + i]
    end
    return sum(c)
end
;  @ h:\Code\Renderer\simdVload.jl:3 within `test`
; Function Attrs: uwtable
define float @julia_test_3492({}* noundef nonnull align 16 dereferenceable(40) %0) #0 {
top:
;  @ h:\Code\Renderer\simdVload.jl:9 within `test`
; β”Œ @ essentials.jl:10 within `length`
   %1 = bitcast {}* %0 to { i8*, i64, i16, i16, i32 }*
   %2 = getelementptr inbounds { i8*, i64, i16, i16, i32 }, { i8*, i64, i16, i16, i32 }* %1, i64 0, i32 1
   %3 = load i64, i64* %2, align 8
; β””
; β”Œ @ range.jl:22 within `Colon`
; β”‚β”Œ @ range.jl:24 within `_colon`
; β”‚β”‚β”Œ @ range.jl:373 within `StepRange` @ range.jl:320
     %4 = call i64 @j_steprange_last_3495(i64 signext 1, i64 signext 8, i64 signext %3) #0
; β””β””β””
; β”Œ @ range.jl:887 within `iterate`
; β”‚β”Œ @ range.jl:659 within `isempty`
; β”‚β”‚β”Œ @ bool.jl:38 within `&`
     %5 = icmp slt i64 %4, 1
; β””β””β””
  br i1 %5, label %L84, label %L18.preheader

L18.preheader:                                    ; preds = %top
  %6 = bitcast {}* %0 to i8**
  %7 = load i8*, i8** %6, align 8
;  @ h:\Code\Renderer\simdVload.jl:11 within `test`
  br label %L18

L18:                                              ; preds = %L18, %L18.preheader
  %value_phi3 = phi i64 [ %12, %L18 ], [ 1, %L18.preheader ]
  %value_phi5 = phi <8 x float> [ %11, %L18 ], [ zeroinitializer, %L18.preheader ]
;  @ h:\Code\Renderer\simdVload.jl:10 within `test`
; β”Œ @ C:\Users\rag\.julia\packages\SIMD\7eukp\src\arrayops.jl:302 within `getindex`
; β”‚β”Œ @ C:\Users\rag\.julia\packages\SIMD\7eukp\src\arrayops.jl:286 within `_pointer`
; β”‚β”‚β”Œ @ abstractarray.jl:1243 within `pointer`
; β”‚β”‚β”‚β”Œ @ abstractarray.jl:1247 within `_memory_offset`
; β”‚β”‚β”‚β”‚β”Œ @ int.jl:88 within `*`
       %8 = shl i64 %value_phi3, 2
       %9 = add nsw i64 %8, -4
; β”‚β”‚β”‚β””β””
; β”‚β”‚β”‚β”Œ @ pointer.jl:167 within `+`
      %10 = getelementptr i8, i8* %7, i64 %9
; β”‚β””β””β””
; β”‚β”Œ @ C:\Users\rag\.julia\packages\SIMD\7eukp\src\arrayops.jl:49 within `vload` @ C:\Users\rag\.julia\packages\SIMD\7eukp\src\arrayops.jl:49 @ C:\Users\rag\.julia\packages\SIMD\7eukp\src\arrayops.jl:50
; β”‚β”‚β”Œ @ C:\Users\rag\.julia\packages\SIMD\7eukp\src\LLVM_intrinsics.jl:462 within `load`
; β”‚β”‚β”‚β”Œ @ C:\Users\rag\.julia\packages\SIMD\7eukp\src\LLVM_intrinsics.jl:471 within `macro expansion`
      %ptr.i = bitcast i8* %10 to <8 x float>*
      %res.i16 = load <8 x float>, <8 x float>* %ptr.i, align 4
; β””β””β””β””
; β”Œ @ C:\Users\rag\.julia\packages\SIMD\7eukp\src\simdvec.jl:259 within `add_fast`
; β”‚β”Œ @ C:\Users\rag\.julia\packages\SIMD\7eukp\src\LLVM_intrinsics.jl:212 within `fadd`
; β”‚β”‚β”Œ @ C:\Users\rag\.julia\packages\SIMD\7eukp\src\LLVM_intrinsics.jl:221 within `macro expansion`
     %11 = fadd fast <8 x float> %res.i16, %value_phi5
; β””β””β””
;  @ h:\Code\Renderer\simdVload.jl:11 within `test`
; β”Œ @ range.jl:891 within `iterate`
; β”‚β”Œ @ promotion.jl:499 within `==`
    %.not = icmp eq i64 %value_phi3, %4
; β”‚β””
   %12 = add i64 %value_phi3, 8
; β””
  br i1 %.not, label %L84, label %L18

L84:                                              ; preds = %L18, %top
  %value_phi10 = phi <8 x float> [ zeroinitializer, %top ], [ %11, %L18 ]
;  @ h:\Code\Renderer\simdVload.jl:12 within `test`
; β”Œ @ C:\Users\rag\.julia\packages\SIMD\7eukp\src\simdvec.jl:475 within `sum`
; β”‚β”Œ @ C:\Users\rag\.julia\packages\SIMD\7eukp\src\LLVM_intrinsics.jl:826 within `reduce_fadd`
; β”‚β”‚β”Œ @ C:\Users\rag\.julia\packages\SIMD\7eukp\src\LLVM_intrinsics.jl:842 within `macro expansion`
     %res.i = call reassoc float @llvm.vector.reduce.fadd.v8f32(float 0.000000e+00, <8 x float> %value_phi10)
; β””β””β””
  ret float %res.i
}
function test4(x)
    N = 8

    c = Vec{N, Float32}(0)
    lane = VecRange{N}(0)

    i = 1

    @inbounds @fastmath while i ≀ length(x)
        c += x[lane + i]
        i += N
    end
    
    return sum(c)
end
;  @ h:\Code\Renderer\simdVload.jl:26 within `test4`
; Function Attrs: uwtable
define float @julia_test4_3502({}* noundef nonnull align 16 dereferenceable(40) %0) #0 {
top:
;  @ h:\Code\Renderer\simdVload.jl:34 within `test4`
; β”Œ @ essentials.jl:10 within `length`
   %1 = bitcast {}* %0 to { i8*, i64, i16, i16, i32 }*
   %2 = getelementptr inbounds { i8*, i64, i16, i16, i32 }, { i8*, i64, i16, i16, i32 }* %1, i64 0, i32 1
   %3 = load i64, i64* %2, align 8
; β””
; β”Œ @ int.jl:488 within `<=`
   %.not9 = icmp eq i64 %3, 0
; β””
  br i1 %.not9, label %L62, label %L8.lr.ph

L8.lr.ph:                                         ; preds = %top
  %4 = bitcast {}* %0 to i8**
  %5 = load i8*, i8** %4, align 8
  br label %L8

L8:                                               ; preds = %L8, %L8.lr.ph
  %value_phi111 = phi <8 x float> [ zeroinitializer, %L8.lr.ph ], [ %9, %L8 ]
  %value_phi10 = phi i64 [ 1, %L8.lr.ph ], [ %10, %L8 ]
;  @ h:\Code\Renderer\simdVload.jl:35 within `test4`
; β”Œ @ C:\Users\rag\.julia\packages\SIMD\7eukp\src\arrayops.jl:302 within `getindex`
; β”‚β”Œ @ C:\Users\rag\.julia\packages\SIMD\7eukp\src\arrayops.jl:286 within `_pointer`
; β”‚β”‚β”Œ @ abstractarray.jl:1243 within `pointer`
; β”‚β”‚β”‚β”Œ @ abstractarray.jl:1247 within `_memory_offset`
; β”‚β”‚β”‚β”‚β”Œ @ int.jl:88 within `*`
       %6 = shl i64 %value_phi10, 2
       %7 = add i64 %6, -4
; β”‚β”‚β”‚β””β””
; β”‚β”‚β”‚β”Œ @ pointer.jl:167 within `+`
      %8 = getelementptr i8, i8* %5, i64 %7
; β”‚β””β””β””
; β”‚β”Œ @ C:\Users\rag\.julia\packages\SIMD\7eukp\src\arrayops.jl:49 within `vload` @ C:\Users\rag\.julia\packages\SIMD\7eukp\src\arrayops.jl:49 @ C:\Users\rag\.julia\packages\SIMD\7eukp\src\arrayops.jl:50
; β”‚β”‚β”Œ @ C:\Users\rag\.julia\packages\SIMD\7eukp\src\LLVM_intrinsics.jl:462 within `load`
; β”‚β”‚β”‚β”Œ @ C:\Users\rag\.julia\packages\SIMD\7eukp\src\LLVM_intrinsics.jl:471 within `macro expansion`
      %ptr.i = bitcast i8* %8 to <8 x float>*
      %res.i8 = load <8 x float>, <8 x float>* %ptr.i, align 4
; β””β””β””β””
; β”Œ @ C:\Users\rag\.julia\packages\SIMD\7eukp\src\simdvec.jl:259 within `add_fast`
; β”‚β”Œ @ C:\Users\rag\.julia\packages\SIMD\7eukp\src\LLVM_intrinsics.jl:212 within `fadd`
; β”‚β”‚β”Œ @ C:\Users\rag\.julia\packages\SIMD\7eukp\src\LLVM_intrinsics.jl:221 within `macro expansion`
     %9 = fadd fast <8 x float> %res.i8, %value_phi111
; β””β””β””
;  @ h:\Code\Renderer\simdVload.jl:36 within `test4`
; β”Œ @ fastmath.jl:270 within `add_fast`
; β”‚β”Œ @ int.jl:87 within `+`
    %10 = add nuw nsw i64 %value_phi10, 1
; β””β””
;  @ h:\Code\Renderer\simdVload.jl:34 within `test4`
; β”Œ @ int.jl:488 within `<=`
   %exitcond = icmp eq i64 %value_phi10, %3
; β””
  br i1 %exitcond, label %L62, label %L8

L62:                                              ; preds = %L8, %top
  %value_phi1.lcssa = phi <8 x float> [ zeroinitializer, %top ], [ %9, %L8 ]
;  @ h:\Code\Renderer\simdVload.jl:39 within `test4`
; β”Œ @ C:\Users\rag\.julia\packages\SIMD\7eukp\src\simdvec.jl:475 within `sum`
; β”‚β”Œ @ C:\Users\rag\.julia\packages\SIMD\7eukp\src\LLVM_intrinsics.jl:826 within `reduce_fadd`
; β”‚β”‚β”Œ @ C:\Users\rag\.julia\packages\SIMD\7eukp\src\LLVM_intrinsics.jl:842 within `macro expansion`
     %res.i = call reassoc float @llvm.vector.reduce.fadd.v8f32(float 0.000000e+00, <8 x float> %value_phi1.lcssa)
; β””β””β””
  ret float %res.i
}
function test7(x)
    c = 0

    i = 1

    while i ≀ length(x)
        c += x[i]

        i += 1
    end

    return c
end
;  @ h:\Code\Renderer\simdVload.jl:122 within `test7`
; Function Attrs: uwtable
define { {}*, i8 } @julia_test7_3511([8 x i8]* noalias nocapture noundef nonnull align 8 dereferenceable(8) %0, {}* noundef nonnull align 16 dereferenceable(40) %1) #0 {
top:
;  @ h:\Code\Renderer\simdVload.jl:127 within `test7`
; β”Œ @ essentials.jl:10 within `length`
   %2 = bitcast {}* %1 to { i8*, i64, i16, i16, i32 }*
   %3 = getelementptr inbounds { i8*, i64, i16, i16, i32 }, { i8*, i64, i16, i16, i32 }* %2, i64 0, i32 1
   %4 = load i64, i64* %3, align 8
; β””
; β”Œ @ int.jl:488 within `<=`
   %.not23 = icmp eq i64 %4, 0
; β””
  br i1 %.not23, label %union_move6, label %idxend.peel

idxend.peel:                                      ; preds = %top
  %5 = bitcast {}* %1 to float**
  %6 = load float*, float** %5, align 8
;  @ h:\Code\Renderer\simdVload.jl:128 within `test7`
; β”Œ @ essentials.jl:13 within `getindex`
   %7 = load float, float* %6, align 4
; β””
;  @ h:\Code\Renderer\simdVload.jl within `test7`
  %value_phi5.peel = fadd float %7, 0.000000e+00
;  @ h:\Code\Renderer\simdVload.jl:127 within `test7`
; β”Œ @ int.jl:488 within `<=`
   %.not.peel = icmp eq i64 %4, 1
; β””
  br i1 %.not.peel, label %union_move, label %L9

L9:                                               ; preds = %idxend, %idxend.peel
  %value_phi25 = phi i64 [ %14, %idxend ], [ 2, %idxend.peel ]
  %8 = phi float [ %value_phi5, %idxend ], [ %value_phi5.peel, %idxend.peel ]
;  @ h:\Code\Renderer\simdVload.jl:128 within `test7`
; β”Œ @ essentials.jl:13 within `getindex`
   %9 = add nsw i64 %value_phi25, -1
   %10 = icmp ult i64 %9, %4
   br i1 %10, label %idxend, label %oob

oob:                                              ; preds = %L9
   %11 = alloca i64, align 8
   store i64 %value_phi25, i64* %11, align 8
   call void @ijl_bounds_error_ints({}* %1, i64* nonnull %11, i64 1)
   unreachable

idxend:                                           ; preds = %L9
   %12 = getelementptr inbounds float, float* %6, i64 %9
   %13 = load float, float* %12, align 4
; β””
;  @ h:\Code\Renderer\simdVload.jl within `test7`
  %value_phi5 = fadd float %8, %13
;  @ h:\Code\Renderer\simdVload.jl:130 within `test7`
; β”Œ @ int.jl:87 within `+`
   %14 = add nuw nsw i64 %value_phi25, 1
; β””
;  @ h:\Code\Renderer\simdVload.jl:127 within `test7`
; β”Œ @ int.jl:488 within `<=`
   %.not.not = icmp ult i64 %value_phi25, %4
; β””
  br i1 %.not.not, label %L9, label %union_move

post_union_move:                                  ; preds = %union_move6, %union_move
  %tindex_phi.lcssa37 = phi i8 [ 2, %union_move6 ], [ 1, %union_move ]
;  @ h:\Code\Renderer\simdVload.jl:133 within `test7`
  %15 = insertvalue { {}*, i8 } { {}* null, i8 undef }, i8 %tindex_phi.lcssa37, 1
  ret { {}*, i8 } %15

union_move:                                       ; preds = %idxend, %idxend.peel
  %.lcssa.in = phi float [ %value_phi5.peel, %idxend.peel ], [ %value_phi5, %idxend ]
  %16 = bitcast [8 x i8]* %0 to float*
  store float %.lcssa.in, float* %16, align 8
  br label %post_union_move

union_move6:                                      ; preds = %top
  %.sroa_cast15 = bitcast [8 x i8]* %0 to i32*
  store i32 0, i32* %.sroa_cast15, align 8
  %.sroa_idx = getelementptr inbounds [8 x i8], [8 x i8]* %0, i64 0, i64 4
  %.sroa_cast16 = bitcast i8* %.sroa_idx to i32*
  store i32 0, i32* %.sroa_cast16, align 4
  br label %post_union_move
}
function test7(x)
    c = 0

    i = 1

    while i ≀ length(x)
        @inbounds c += x[i]

        i += 1
    end

    return c
end
;  @ h:\Code\Renderer\simdVload.jl:122 within `test7`
; Function Attrs: uwtable
define { {}*, i8 } @julia_test7_3517([8 x i8]* noalias nocapture noundef nonnull align 8 dereferenceable(8) %0, {}* noundef nonnull align 16 dereferenceable(40) %1) #0 {
top:
;  @ h:\Code\Renderer\simdVload.jl:127 within `test7`
; β”Œ @ essentials.jl:10 within `length`
   %2 = bitcast {}* %1 to { i8*, i64, i16, i16, i32 }*
   %3 = getelementptr inbounds { i8*, i64, i16, i16, i32 }, { i8*, i64, i16, i16, i32 }* %2, i64 0, i32 1
   %4 = load i64, i64* %3, align 8
; β””
; β”Œ @ int.jl:488 within `<=`
   %.not20 = icmp eq i64 %4, 0
; β””
  br i1 %.not20, label %union_move6, label %L9.lr.ph

L9.lr.ph:                                         ; preds = %top
  %5 = bitcast {}* %1 to float**
  %6 = load float*, float** %5, align 8
;  @ h:\Code\Renderer\simdVload.jl:128 within `test7`
; β”Œ @ essentials.jl:13 within `getindex`
   %7 = load float, float* %6, align 4
; β””
;  @ h:\Code\Renderer\simdVload.jl within `test7`
  %value_phi5.peel = fadd float %7, 0.000000e+00
;  @ h:\Code\Renderer\simdVload.jl:127 within `test7`
; β”Œ @ int.jl:488 within `<=`
   %exitcond.peel = icmp eq i64 %4, 1
; β””
  br i1 %exitcond.peel, label %union_move, label %L9

L9:                                               ; preds = %L9, %L9.lr.ph
  %value_phi22 = phi i64 [ %12, %L9 ], [ 2, %L9.lr.ph ]
  %8 = phi float [ %value_phi5, %L9 ], [ %value_phi5.peel, %L9.lr.ph ]
;  @ h:\Code\Renderer\simdVload.jl:128 within `test7`
; β”Œ @ essentials.jl:13 within `getindex`
   %9 = add nsw i64 %value_phi22, -1
   %10 = getelementptr inbounds float, float* %6, i64 %9
   %11 = load float, float* %10, align 4
; β””
;  @ h:\Code\Renderer\simdVload.jl within `test7`
  %value_phi5 = fadd float %8, %11
;  @ h:\Code\Renderer\simdVload.jl:130 within `test7`
; β”Œ @ int.jl:87 within `+`
   %12 = add nuw nsw i64 %value_phi22, 1
; β””
;  @ h:\Code\Renderer\simdVload.jl:127 within `test7`
; β”Œ @ int.jl:488 within `<=`
   %exitcond = icmp eq i64 %value_phi22, %4
; β””
  br i1 %exitcond, label %union_move, label %L9

post_union_move:                                  ; preds = %union_move6, %union_move
  %tindex_phi.lcssa31 = phi i8 [ 2, %union_move6 ], [ 1, %union_move ]
;  @ h:\Code\Renderer\simdVload.jl:133 within `test7`
  %13 = insertvalue { {}*, i8 } { {}* null, i8 undef }, i8 %tindex_phi.lcssa31, 1
  ret { {}*, i8 } %13

union_move:                                       ; preds = %L9, %L9.lr.ph
  %.lcssa.in = phi float [ %value_phi5.peel, %L9.lr.ph ], [ %value_phi5, %L9 ]
  %14 = bitcast [8 x i8]* %0 to float*
  store float %.lcssa.in, float* %14, align 8
  br label %post_union_move

union_move6:                                      ; preds = %top
  %.sroa_cast15 = bitcast [8 x i8]* %0 to i32*
  store i32 0, i32* %.sroa_cast15, align 8
  %.sroa_idx = getelementptr inbounds [8 x i8], [8 x i8]* %0, i64 0, i64 4
  %.sroa_cast16 = bitcast i8* %.sroa_idx to i32*
  store i32 0, i32* %.sroa_cast16, align 4
  br label %post_union_move
}
function test8(x)
    c = 0

    for i in eachindex(x)
        c += x[i]
    end

    return c
end
;  @ h:\Code\Renderer\simdVload.jl:132 within `test8`
; Function Attrs: uwtable
define { {}*, i8 } @julia_test8_3514([8 x i8]* noalias nocapture noundef nonnull align 8 dereferenceable(8) %0, {}* noundef nonnull align 16 dereferenceable(40) %1) #0 {
top:
;  @ h:\Code\Renderer\simdVload.jl:135 within `test8`
; β”Œ @ abstractarray.jl:314 within `eachindex`
; β”‚β”Œ @ abstractarray.jl:133 within `axes1`
; β”‚β”‚β”Œ @ abstractarray.jl:98 within `axes`
; β”‚β”‚β”‚β”Œ @ array.jl:149 within `size`
      %2 = bitcast {}* %1 to { i8*, i64, i16, i16, i32 }*
      %3 = getelementptr inbounds { i8*, i64, i16, i16, i32 }, { i8*, i64, i16, i16, i32 }* %2, i64 0, i32 1
      %4 = load i64, i64* %3, align 8
; β””β””β””β””
; β”Œ @ range.jl:887 within `iterate`
; β”‚β”Œ @ range.jl:662 within `isempty`
; β”‚β”‚β”Œ @ operators.jl:369 within `>`
; β”‚β”‚β”‚β”Œ @ int.jl:83 within `<`
      %.not.not = icmp eq i64 %4, 0
; β””β””β””β””
  br i1 %.not.not, label %union_move15, label %idxend.lr.ph

idxend.lr.ph:                                     ; preds = %top
  %5 = bitcast {}* %1 to float**
  %6 = load float*, float** %5, align 8
;  @ h:\Code\Renderer\simdVload.jl:136 within `test8`
; β”Œ @ essentials.jl:13 within `getindex`
   br label %idxend

L43:                                              ; preds = %idxend
; β””
;  @ h:\Code\Renderer\simdVload.jl:137 within `test8`
; β”Œ @ range.jl:891 within `iterate`
   %7 = add nuw nsw i64 %value_phi347, 1
; β””
;  @ h:\Code\Renderer\simdVload.jl:135 within `test8`
; β”Œ @ range.jl:887 within `iterate`
   %8 = bitcast float %value_phi9 to i32
; β””
;  @ h:\Code\Renderer\simdVload.jl:136 within `test8`
; β”Œ @ essentials.jl:13 within `getindex`
   br label %idxend

idxend:                                           ; preds = %L43, %idxend.lr.ph
   %9 = phi i64 [ 0, %idxend.lr.ph ], [ %value_phi347, %L43 ]
   %value_phi548 = phi i1 [ true, %idxend.lr.ph ], [ false, %L43 ]
   %value_phi347 = phi i64 [ 1, %idxend.lr.ph ], [ %7, %L43 ]
   %.sroa.030.046 = phi i32 [ 0, %idxend.lr.ph ], [ %8, %L43 ]
   %10 = getelementptr inbounds float, float* %6, i64 %9
   %11 = load float, float* %10, align 4
; β””
  %12 = bitcast i32 %.sroa.030.046 to float
  %13 = uitofp i32 %.sroa.030.046 to float
  %.pn = select i1 %value_phi548, float %13, float %12
;  @ h:\Code\Renderer\simdVload.jl within `test8`
  %value_phi9 = fadd float %.pn, %11
;  @ h:\Code\Renderer\simdVload.jl:137 within `test8`
; β”Œ @ range.jl:891 within `iterate`
; β”‚β”Œ @ promotion.jl:499 within `==`
    %.not = icmp eq i64 %value_phi347, %4
; β””β””
  br i1 %.not, label %union_move, label %L43

post_union_move:                                  ; preds = %union_move15, %union_move
  %tindex_phi1344 = phi i8 [ 2, %union_move15 ], [ 1, %union_move ]
;  @ h:\Code\Renderer\simdVload.jl:139 within `test8`
  %14 = insertvalue { {}*, i8 } { {}* null, i8 undef }, i8 %tindex_phi1344, 1
  ret { {}*, i8 } %14

union_move:                                       ; preds = %idxend
  %15 = bitcast [8 x i8]* %0 to float*
  store float %value_phi9, float* %15, align 8
  br label %post_union_move

union_move15:                                     ; preds = %top
  %16 = bitcast [8 x i8]* %0 to i64*
  store i64 0, i64* %16, align 8
  br label %post_union_move
}

It makes more sense that the while loop has bounds-checking without the @inbounds, but now I’m confused.

I would have expected the inbounds check to make a timing difference in test7, I mean the only difference in the @code_llvm is that it adds a comparison and branch to possibly throwing an error.

I’m not sure why the test8 for loop version would be so much slower, the @code_llvm not only looks shorter it weirdly does not have bounds check code like test7 does (maybe eachindex is smart?). It does look like test8 has a integer-to-float type conversion in the loop that test7 doesn’t (uitop) before the fadd. I think this might be rooted in your c=0, which makes c type-unstable because you may or may not add Float32s to an Int64. But I could be mistaken because I can’t read LLVM well at all (like why all the i32 if c starts at Int64) and the test7 while loop also starts with c=0, which I would expect to also be a type instability requiring type conversions. If you want to fix that, you just do c = zero(eltype(x)), that should compile to the stable type of zero.

Could you edit your post to include your timing calls and results too? I assume you’re using @btime and interpolating a precomputed input array so it isn’t part of the benchmark.

To clarify, when I said

Using @inbounds on the while loop made no difference to the speed but looking at the llvm it doesn’t seem to be skipping the bounds checking

I meant that the version not using @inbounds wasn’t skipping bounds checking, the version using @inbounds does seem to be skiping it.

I think the lack of a timing difference is probably because it’s quite cheap to do the branching especially when it should be able to predict that we get the same result every time.
The problem is due to c = 0, changing to c = zero(eltype(x)) fixes the slowness.

Whatever is causing the for loop to be slow with c=0 isn’t relevant to the case I care about which is iterating over 1:N:length(x) so I’m not going to spend any more time thinking about it.

Now when iterating over 1:N:length(x), I’ve managed to get it running 1ns faster than the while loop by adding @inline to steprange_last and to the constructors of StepRange. I’m not sure if there’s a downside to this but if not I might make a pr to fix this.

1 Like

It seems I can’t edit my old posts, but here are the benchmarks where x = rand(Float32, 800) (with c=0 i.e. same code for which I posted the llvm)

julia> @benchmark test($x)
BenchmarkTools.Trial: 10000 samples with 968 evaluations.
 Range (min … max):  84.091 ns … 287.397 ns  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     84.194 ns               β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   86.212 ns Β±  14.460 ns  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

  β–ˆβ–‚                                                           ▁
  β–ˆβ–ˆβ–ˆβ–‡β–‡β–†β–‡β–†β–‡β–†β–‡β–…β–…β–…β–ƒβ–„β–ƒβ–„β–β–ƒβ–β–ƒβ–β–„β–β–β–β–β–ƒβ–β–β–ƒβ–β–β–β–β–β–β–β–β–β–ƒβ–β–ƒβ–β–β–β–β–β–β–β–ƒβ–β–β–β–β–ƒβ–β–β–… β–ˆ
  84.1 ns       Histogram: log(frequency) by time       210 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark test4($x)
BenchmarkTools.Trial: 10000 samples with 978 evaluations.
 Range (min … max):  70.552 ns … 276.074 ns  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     71.268 ns               β”Š GC (median):    0.00%        
 Time  (mean Β± Οƒ):   72.845 ns Β±  11.458 ns  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

  β–ˆβ–…β–‡ ▆▁▁▁                                                     ▁
  β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–…β–†β–†β–†β–‡β–‡β–‡β–…β–…β–†β–†β–†β–„β–ƒβ–„β–…β–…β–„β–ƒβ–„β–ƒβ–‚β–ƒβ–‚β–„β–‚β–ƒβ–‚β–ƒβ–„β–‚β–ƒβ–„β–ƒβ–‚β–‚β–ƒβ–…β–‡β–‡β–†β–†β–ƒβ–„β–„β–‚β–ƒβ–…β–„β–‚ β–ˆ
  70.6 ns       Histogram: log(frequency) by time      95.3 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

@benchmark test7_noinbounds($x)
BenchmarkTools.Trial: 10000 samples with 171 evaluations.
 Range (min … max):  630.409 ns …  4.875 ΞΌs  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     630.409 ns              β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   635.315 ns Β± 67.407 ns  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

  β–ˆβ–…      ▃▁▃▁▁                                                ▁
  β–ˆβ–ˆβ–…β–…β–„β–„β–…β–…β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–†β–†β–†β–†β–†β–†β–…β–†β–†β–…β–…β–†β–…β–…β–…β–…β–„β–†β–†β–…β–…β–…β–„β–…β–…β–…β–„β–„β–…β–„β–„β–ƒβ–„β–…β–„β–β–„β–„β–„β–„β–„β–„β–β–…β–…β–„ β–ˆ
  630 ns        Histogram: log(frequency) by time       697 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark test7_inbounds($x)
BenchmarkTools.Trial: 10000 samples with 173 evaluations.
 Range (min … max):  628.902 ns …  1.970 ΞΌs  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     630.058 ns              β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   644.193 ns Β± 97.365 ns  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

  β–ˆβ–„β–‚β–                                                         ▁
  β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–†β–†β–‡β–†β–†β–†β–†β–†β–†β–†β–†β–…β–†β–†β–…β–„β–…β–„β–ƒβ–…β–„β–†β–„β–ƒβ–„β–„β–ƒβ–„β–ƒβ–ƒβ–β–ƒβ–ƒβ–β–β–β–β–ƒβ–β–β–„β–β–β–β–β–β–β–β–β–β–β–ƒ β–ˆ
  629 ns        Histogram: log(frequency) by time      1.01 ΞΌs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark test8($x)
BenchmarkTools.Trial: 10000 samples with 9 evaluations.
 Range (min … max):  2.111 ΞΌs …  10.022 ΞΌs  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     2.111 ΞΌs               β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   2.192 ΞΌs Β± 471.679 ns  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

  β–ˆ  β–‚                                                      ▁ ▁
  β–ˆβ–†β–„β–ˆβ–†β–„β–„β–…β–ƒβ–†β–ƒβ–„β–ƒβ–„β–ƒβ–ƒβ–ƒβ–ƒβ–β–β–ƒβ–β–β–β–ƒβ–ƒβ–ƒβ–β–„β–ƒβ–„β–ƒβ–β–β–β–β–β–β–ƒβ–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–ˆ β–ˆ
  2.11 ΞΌs      Histogram: log(frequency) by time      5.29 ΞΌs <

 Memory estimate: 0 bytes, allocs estimate: 0.

To summarise, the construction of a steprange doesn’t get inlined and so adds a couple nanoseconds of overhead. I made a pr proposing a fix here, Inline StepRange construction by Zentrik Β· Pull Request #49270 Β· JuliaLang/julia Β· GitHub.