# Iterating over range is slower than while loop

``````using BenchmarkTools, SIMD

function test(x)
N = 8

c = Vec{N, Float32}(0)
lane = VecRange{N}(0)

@inbounds @fastmath for i in 1:N:length(x)
c += x[lane + i]
end
return sum(c)
end

function test4(x)
N = 8

c = Vec{N, Float32}(0)
lane = VecRange{N}(0)

i = 1

@inbounds @fastmath while i β€ length(x)
c += x[lane + i]
i += N
end

return sum(c)
end

x = rand(Float32, 800)
``````

On Julia 1.9 and 1.8.5 the while loop takes 70ns and the for loop 82ns. Looking at the llvm/native code the main difference seems to be that the for loop code calls `steprange_last` and it doesnβt get inlined whilst the while loop doesnβt:

``````top:
%1 = bitcast {}* %0 to { i8*, i64, i16, i16, i32 }*
%2 = getelementptr inbounds { i8*, i64, i16, i16, i32 }, { i8*, i64, i16, i16, i32 }* %1, i64 0, i32 1
%3 = load i64, i64* %2, align 8
%4 = call i64 @j_steprange_last_2487(i64 signext 1, i64 signext 8, i64 signext %3) #0
%5 = icmp slt i64 %4, 1
br i1 %5, label %L84, label %L18.preheader

top:
%1 = bitcast {}* %0 to { i8*, i64, i16, i16, i32 }*
%2 = getelementptr inbounds { i8*, i64, i16, i16, i32 }, { i8*, i64, i16, i16, i32 }* %1, i64 0, i32 1
%3 = load i64, i64* %2, align 8
%.not9 = icmp eq i64 %3, 0
br i1 %.not9, label %L64, label %L10.lr.ph
``````

If I replace `length(x)` with 800 in the for loop, the performance matches the while loop and we can see that there is no call to `steprange_last` made in the function.
I also tested using Float64 instead but the gap remained.

So the problem seems to be that the call to `steprange_last` is not inlined, is there any way to fix this?

Β

Hereβs an example not using SIMD (using `eachindex(x)` or `1:length(x)` seem equally bad)

``````function test7(x)
c = 0

i = 1

while i β€ length(x)
c += x[i]

i += 1
end

return c
end

function test8(x)
c = 0

for i in eachindex(x)
c += x[i]
end

return c
end
``````

Without inbounds the gap is even bigger with the while loop taking 630ns and the for loop 2.1 microseconds. Adding `@inbounds` to the for loop brings it down to 630ns.
I donβt understand why the llvm code for the non inbounds for loop is so different or why the while loop doesnβt have the same problem.

For loop without `@inbounds` (the actual loop part)

``````L43:                                              ; preds = %idxend  %7 = add nuw nsw i64 %value_phi333, 1
%8 = bitcast double %value_phi9 to i64
br label %idxend

idxend:                                           ; preds = %L43, %idxend.lr.ph
%9 = phi i64 [ 0, %idxend.lr.ph ], [ %value_phi333, %L43 ]
%value_phi534 = phi i1 [ true, %idxend.lr.ph ], [ false, %L43 ]
%value_phi333 = phi i64 [ 1, %idxend.lr.ph ], [ %7, %L43 ]
%.sroa.021.032 = phi i64 [ 0, %idxend.lr.ph ], [ %8, %L43 ]
%10 = getelementptr inbounds double, double* %6, i64 %9
%11 = load double, double* %10, align 8
%12 = bitcast i64 %.sroa.021.032 to double
%13 = sitofp i64 %.sroa.021.032 to double
%.pn = select i1 %value_phi534, double %13, double %12
%value_phi9 = fadd double %.pn, %11
%.not = icmp eq i64 %value_phi333, %4
br i1 %.not, label %union_move, label %L43
``````

With `@inbounds`

``````L43:                                              ; preds = %L43, %L13.preheader
%value_phi935 = phi double [ %value_phi9, %L43 ], [ %value_phi932, %L13.preheader ]
%value_phi334 = phi i64 [ %8, %L43 ], [ 1, %L13.preheader ]
%8 = add i64 %value_phi334, 1
%9 = getelementptr inbounds double, double* %6, i64 %value_phi334  %10 = load double, double* %9, align 8
%value_phi9 = fadd double %value_phi935, %10
%.not.not25 = icmp eq i64 %8, %4
br i1 %.not.not25, label %union_move, label %L43
``````

In the first example, itβs not an issue of inlining, itβs an issue of compile-time execution. `N=8` is in the scope, so if you changed `length(x)` to `800`, you have a constant `1:8:800`, so the construction of that `StepRange` along with the `steprange_last` call can occur at compile-time. `1:8:length(x)` on the other hand must wait until runtime to get the necessary inputs, so the constructor code is showing up. This necessarily takes longer than the `while` loop version because `steprange_last` is additionally computing the inclusive upper bound e.g. `1:3:9` is computed to `1:3:7`; I assume this is for comparisons. I wouldnβt worry about this though, the construction is not occurring in the loop, so the overhead should not scale with `x`.

Iβm prefacing this with saying that Iβm not entirely sure if Iβm right. The 2nd example seems to be suggesting that the compiler is automatically recognizing the `while` loop doesnβt need bounds checking, and if thatβs the case I wouldnβt know how itβs deciding that. I suppose you could check by annotating `@inbounds c += x[i]` there and see if it doesnβt make a difference.

You should also provide the exact `@code_llvm` line, the function callβs input types absolutely affect what code is produced. So far Iβm assuming something like `rand(1000)`.

1 Like

The overhead of `steprange_last` is a problem for me because I plan on calling a similar function trillions of times. While the length isnβt changing it would be annoying to precompute this iterator and then pass it into a bunch of functions.

I should have made it clearer but Iβm using `x = rand(Float32, 800)` everywhere. Using `@inbounds` on the while loop made no difference to the speed but looking at the llvm it doesnβt seem to be skipping the bounds checking

Without `@inbounds` the loop part is this

``````L9:                                               ; preds = %idxend, %idxend.peel
%value_phi25 = phi i64 [ %14, %idxend ], [ 2, %idxend.peel ]
%8 = phi float [ %value_phi5, %idxend ], [ %value_phi5.peel, %idxend.peel ]
%9 = add nsw i64 %value_phi25, -1
%10 = icmp ult i64 %9, %4
br i1 %10, label %idxend, label %oob

oob:                                              ; preds = %L9
%11 = alloca i64, align 8
store i64 %value_phi25, i64* %11, align 8
call void @ijl_bounds_error_ints({}* %1, i64* nonnull %11, i64 1)
unreachable

idxend:                                           ; preds = %L9
%12 = getelementptr inbounds float, float* %6, i64 %9
%13 = load float, float* %12, align 4
%value_phi5 = fadd float %8, %13
%14 = add nuw nsw i64 %value_phi25, 1
%.not.not = icmp ult i64 %value_phi25, %4
br i1 %.not.not, label %L9, label %union_move
``````

With `@inbounds`

``````L9:                                               ; preds = %L9, %L9.lr.ph
%value_phi22 = phi i64 [ %12, %L9 ], [ 2, %L9.lr.ph
]
%8 = phi float [ %value_phi5, %L9 ], [ %value_phi5.peel, %L9.lr.ph ]
%9 = add nsw i64 %value_phi22, -1
%10 = getelementptr inbounds float, float* %6, i64 %9
%11 = load float, float* %10, align 4
%value_phi5 = fadd float %8, %11
%12 = add nuw nsw i64 %value_phi22, 1
%exitcond = icmp eq i64 %value_phi22, %4
br i1 %exitcond, label %union_move, label %L9
``````

Here is the full `@code_llvm` for `x=rand(Float32, 800)` on Julia 1.9.0-rc1

``````function test(x)
N = 8

c = Vec{N, Float32}(0)
lane = VecRange{N}(0)

@inbounds @fastmath for i in 1:N:length(x)
c += x[lane + i]
end
return sum(c)
end
``````
``````;  @ h:\Code\Renderer\simdVload.jl:3 within `test`
; Function Attrs: uwtable
define float @julia_test_3492({}* noundef nonnull align 16 dereferenceable(40) %0) #0 {
top:
; β @ essentials.jl:10 within `length`
%1 = bitcast {}* %0 to { i8*, i64, i16, i16, i32 }*
%2 = getelementptr inbounds { i8*, i64, i16, i16, i32 }, { i8*, i64, i16, i16, i32 }* %1, i64 0, i32 1
%3 = load i64, i64* %2, align 8
; β
; β @ range.jl:22 within `Colon`
; ββ @ range.jl:24 within `_colon`
; βββ @ range.jl:373 within `StepRange` @ range.jl:320
%4 = call i64 @j_steprange_last_3495(i64 signext 1, i64 signext 8, i64 signext %3) #0
; βββ
; β @ range.jl:887 within `iterate`
; ββ @ range.jl:659 within `isempty`
; βββ @ bool.jl:38 within `&`
%5 = icmp slt i64 %4, 1
; βββ
br i1 %5, label %L84, label %L18.preheader

%6 = bitcast {}* %0 to i8**
%7 = load i8*, i8** %6, align 8
br label %L18

L18:                                              ; preds = %L18, %L18.preheader
%value_phi3 = phi i64 [ %12, %L18 ], [ 1, %L18.preheader ]
%value_phi5 = phi <8 x float> [ %11, %L18 ], [ zeroinitializer, %L18.preheader ]
; β @ C:\Users\rag\.julia\packages\SIMD\7eukp\src\arrayops.jl:302 within `getindex`
; ββ @ C:\Users\rag\.julia\packages\SIMD\7eukp\src\arrayops.jl:286 within `_pointer`
; βββ @ abstractarray.jl:1243 within `pointer`
; ββββ @ abstractarray.jl:1247 within `_memory_offset`
; βββββ @ int.jl:88 within `*`
%8 = shl i64 %value_phi3, 2
%9 = add nsw i64 %8, -4
; βββββ
; ββββ @ pointer.jl:167 within `+`
%10 = getelementptr i8, i8* %7, i64 %9
; ββββ
; ββ @ C:\Users\rag\.julia\packages\SIMD\7eukp\src\arrayops.jl:49 within `vload` @ C:\Users\rag\.julia\packages\SIMD\7eukp\src\arrayops.jl:49 @ C:\Users\rag\.julia\packages\SIMD\7eukp\src\arrayops.jl:50
; βββ @ C:\Users\rag\.julia\packages\SIMD\7eukp\src\LLVM_intrinsics.jl:462 within `load`
; ββββ @ C:\Users\rag\.julia\packages\SIMD\7eukp\src\LLVM_intrinsics.jl:471 within `macro expansion`
%ptr.i = bitcast i8* %10 to <8 x float>*
%res.i16 = load <8 x float>, <8 x float>* %ptr.i, align 4
; ββββ
; β @ C:\Users\rag\.julia\packages\SIMD\7eukp\src\simdvec.jl:259 within `add_fast`
; ββ @ C:\Users\rag\.julia\packages\SIMD\7eukp\src\LLVM_intrinsics.jl:212 within `fadd`
; βββ @ C:\Users\rag\.julia\packages\SIMD\7eukp\src\LLVM_intrinsics.jl:221 within `macro expansion`
%11 = fadd fast <8 x float> %res.i16, %value_phi5
; βββ
; β @ range.jl:891 within `iterate`
; ββ @ promotion.jl:499 within `==`
%.not = icmp eq i64 %value_phi3, %4
; ββ
%12 = add i64 %value_phi3, 8
; β
br i1 %.not, label %L84, label %L18

L84:                                              ; preds = %L18, %top
%value_phi10 = phi <8 x float> [ zeroinitializer, %top ], [ %11, %L18 ]
; β @ C:\Users\rag\.julia\packages\SIMD\7eukp\src\simdvec.jl:475 within `sum`
; ββ @ C:\Users\rag\.julia\packages\SIMD\7eukp\src\LLVM_intrinsics.jl:826 within `reduce_fadd`
; βββ @ C:\Users\rag\.julia\packages\SIMD\7eukp\src\LLVM_intrinsics.jl:842 within `macro expansion`
%res.i = call reassoc float @llvm.vector.reduce.fadd.v8f32(float 0.000000e+00, <8 x float> %value_phi10)
; βββ
ret float %res.i
}
``````
``````function test4(x)
N = 8

c = Vec{N, Float32}(0)
lane = VecRange{N}(0)

i = 1

@inbounds @fastmath while i β€ length(x)
c += x[lane + i]
i += N
end

return sum(c)
end
``````
``````;  @ h:\Code\Renderer\simdVload.jl:26 within `test4`
; Function Attrs: uwtable
define float @julia_test4_3502({}* noundef nonnull align 16 dereferenceable(40) %0) #0 {
top:
; β @ essentials.jl:10 within `length`
%1 = bitcast {}* %0 to { i8*, i64, i16, i16, i32 }*
%2 = getelementptr inbounds { i8*, i64, i16, i16, i32 }, { i8*, i64, i16, i16, i32 }* %1, i64 0, i32 1
%3 = load i64, i64* %2, align 8
; β
; β @ int.jl:488 within `<=`
%.not9 = icmp eq i64 %3, 0
; β
br i1 %.not9, label %L62, label %L8.lr.ph

L8.lr.ph:                                         ; preds = %top
%4 = bitcast {}* %0 to i8**
%5 = load i8*, i8** %4, align 8
br label %L8

L8:                                               ; preds = %L8, %L8.lr.ph
%value_phi111 = phi <8 x float> [ zeroinitializer, %L8.lr.ph ], [ %9, %L8 ]
%value_phi10 = phi i64 [ 1, %L8.lr.ph ], [ %10, %L8 ]
; β @ C:\Users\rag\.julia\packages\SIMD\7eukp\src\arrayops.jl:302 within `getindex`
; ββ @ C:\Users\rag\.julia\packages\SIMD\7eukp\src\arrayops.jl:286 within `_pointer`
; βββ @ abstractarray.jl:1243 within `pointer`
; ββββ @ abstractarray.jl:1247 within `_memory_offset`
; βββββ @ int.jl:88 within `*`
%6 = shl i64 %value_phi10, 2
%7 = add i64 %6, -4
; βββββ
; ββββ @ pointer.jl:167 within `+`
%8 = getelementptr i8, i8* %5, i64 %7
; ββββ
; ββ @ C:\Users\rag\.julia\packages\SIMD\7eukp\src\arrayops.jl:49 within `vload` @ C:\Users\rag\.julia\packages\SIMD\7eukp\src\arrayops.jl:49 @ C:\Users\rag\.julia\packages\SIMD\7eukp\src\arrayops.jl:50
; βββ @ C:\Users\rag\.julia\packages\SIMD\7eukp\src\LLVM_intrinsics.jl:462 within `load`
; ββββ @ C:\Users\rag\.julia\packages\SIMD\7eukp\src\LLVM_intrinsics.jl:471 within `macro expansion`
%ptr.i = bitcast i8* %8 to <8 x float>*
%res.i8 = load <8 x float>, <8 x float>* %ptr.i, align 4
; ββββ
; β @ C:\Users\rag\.julia\packages\SIMD\7eukp\src\simdvec.jl:259 within `add_fast`
; ββ @ C:\Users\rag\.julia\packages\SIMD\7eukp\src\LLVM_intrinsics.jl:212 within `fadd`
; βββ @ C:\Users\rag\.julia\packages\SIMD\7eukp\src\LLVM_intrinsics.jl:221 within `macro expansion`
%9 = fadd fast <8 x float> %res.i8, %value_phi111
; βββ
; β @ fastmath.jl:270 within `add_fast`
; ββ @ int.jl:87 within `+`
%10 = add nuw nsw i64 %value_phi10, 1
; ββ
; β @ int.jl:488 within `<=`
%exitcond = icmp eq i64 %value_phi10, %3
; β
br i1 %exitcond, label %L62, label %L8

L62:                                              ; preds = %L8, %top
%value_phi1.lcssa = phi <8 x float> [ zeroinitializer, %top ], [ %9, %L8 ]
; β @ C:\Users\rag\.julia\packages\SIMD\7eukp\src\simdvec.jl:475 within `sum`
; ββ @ C:\Users\rag\.julia\packages\SIMD\7eukp\src\LLVM_intrinsics.jl:826 within `reduce_fadd`
; βββ @ C:\Users\rag\.julia\packages\SIMD\7eukp\src\LLVM_intrinsics.jl:842 within `macro expansion`
%res.i = call reassoc float @llvm.vector.reduce.fadd.v8f32(float 0.000000e+00, <8 x float> %value_phi1.lcssa)
; βββ
ret float %res.i
}
``````
``````function test7(x)
c = 0

i = 1

while i β€ length(x)
c += x[i]

i += 1
end

return c
end
``````
``````;  @ h:\Code\Renderer\simdVload.jl:122 within `test7`
; Function Attrs: uwtable
define { {}*, i8 } @julia_test7_3511([8 x i8]* noalias nocapture noundef nonnull align 8 dereferenceable(8) %0, {}* noundef nonnull align 16 dereferenceable(40) %1) #0 {
top:
; β @ essentials.jl:10 within `length`
%2 = bitcast {}* %1 to { i8*, i64, i16, i16, i32 }*
%3 = getelementptr inbounds { i8*, i64, i16, i16, i32 }, { i8*, i64, i16, i16, i32 }* %2, i64 0, i32 1
%4 = load i64, i64* %3, align 8
; β
; β @ int.jl:488 within `<=`
%.not23 = icmp eq i64 %4, 0
; β
br i1 %.not23, label %union_move6, label %idxend.peel

idxend.peel:                                      ; preds = %top
%5 = bitcast {}* %1 to float**
%6 = load float*, float** %5, align 8
; β @ essentials.jl:13 within `getindex`
%7 = load float, float* %6, align 4
; β
%value_phi5.peel = fadd float %7, 0.000000e+00
; β @ int.jl:488 within `<=`
%.not.peel = icmp eq i64 %4, 1
; β
br i1 %.not.peel, label %union_move, label %L9

L9:                                               ; preds = %idxend, %idxend.peel
%value_phi25 = phi i64 [ %14, %idxend ], [ 2, %idxend.peel ]
%8 = phi float [ %value_phi5, %idxend ], [ %value_phi5.peel, %idxend.peel ]
; β @ essentials.jl:13 within `getindex`
%9 = add nsw i64 %value_phi25, -1
%10 = icmp ult i64 %9, %4
br i1 %10, label %idxend, label %oob

oob:                                              ; preds = %L9
%11 = alloca i64, align 8
store i64 %value_phi25, i64* %11, align 8
call void @ijl_bounds_error_ints({}* %1, i64* nonnull %11, i64 1)
unreachable

idxend:                                           ; preds = %L9
%12 = getelementptr inbounds float, float* %6, i64 %9
%13 = load float, float* %12, align 4
; β
%value_phi5 = fadd float %8, %13
; β @ int.jl:87 within `+`
%14 = add nuw nsw i64 %value_phi25, 1
; β
; β @ int.jl:488 within `<=`
%.not.not = icmp ult i64 %value_phi25, %4
; β
br i1 %.not.not, label %L9, label %union_move

post_union_move:                                  ; preds = %union_move6, %union_move
%tindex_phi.lcssa37 = phi i8 [ 2, %union_move6 ], [ 1, %union_move ]
%15 = insertvalue { {}*, i8 } { {}* null, i8 undef }, i8 %tindex_phi.lcssa37, 1
ret { {}*, i8 } %15

union_move:                                       ; preds = %idxend, %idxend.peel
%.lcssa.in = phi float [ %value_phi5.peel, %idxend.peel ], [ %value_phi5, %idxend ]
%16 = bitcast [8 x i8]* %0 to float*
store float %.lcssa.in, float* %16, align 8
br label %post_union_move

union_move6:                                      ; preds = %top
%.sroa_cast15 = bitcast [8 x i8]* %0 to i32*
store i32 0, i32* %.sroa_cast15, align 8
%.sroa_idx = getelementptr inbounds [8 x i8], [8 x i8]* %0, i64 0, i64 4
%.sroa_cast16 = bitcast i8* %.sroa_idx to i32*
store i32 0, i32* %.sroa_cast16, align 4
br label %post_union_move
}
``````
``````function test7(x)
c = 0

i = 1

while i β€ length(x)
@inbounds c += x[i]

i += 1
end

return c
end
``````
``````;  @ h:\Code\Renderer\simdVload.jl:122 within `test7`
; Function Attrs: uwtable
define { {}*, i8 } @julia_test7_3517([8 x i8]* noalias nocapture noundef nonnull align 8 dereferenceable(8) %0, {}* noundef nonnull align 16 dereferenceable(40) %1) #0 {
top:
; β @ essentials.jl:10 within `length`
%2 = bitcast {}* %1 to { i8*, i64, i16, i16, i32 }*
%3 = getelementptr inbounds { i8*, i64, i16, i16, i32 }, { i8*, i64, i16, i16, i32 }* %2, i64 0, i32 1
%4 = load i64, i64* %3, align 8
; β
; β @ int.jl:488 within `<=`
%.not20 = icmp eq i64 %4, 0
; β
br i1 %.not20, label %union_move6, label %L9.lr.ph

L9.lr.ph:                                         ; preds = %top
%5 = bitcast {}* %1 to float**
%6 = load float*, float** %5, align 8
; β @ essentials.jl:13 within `getindex`
%7 = load float, float* %6, align 4
; β
%value_phi5.peel = fadd float %7, 0.000000e+00
; β @ int.jl:488 within `<=`
%exitcond.peel = icmp eq i64 %4, 1
; β
br i1 %exitcond.peel, label %union_move, label %L9

L9:                                               ; preds = %L9, %L9.lr.ph
%value_phi22 = phi i64 [ %12, %L9 ], [ 2, %L9.lr.ph ]
%8 = phi float [ %value_phi5, %L9 ], [ %value_phi5.peel, %L9.lr.ph ]
; β @ essentials.jl:13 within `getindex`
%9 = add nsw i64 %value_phi22, -1
%10 = getelementptr inbounds float, float* %6, i64 %9
%11 = load float, float* %10, align 4
; β
%value_phi5 = fadd float %8, %11
; β @ int.jl:87 within `+`
%12 = add nuw nsw i64 %value_phi22, 1
; β
; β @ int.jl:488 within `<=`
%exitcond = icmp eq i64 %value_phi22, %4
; β
br i1 %exitcond, label %union_move, label %L9

post_union_move:                                  ; preds = %union_move6, %union_move
%tindex_phi.lcssa31 = phi i8 [ 2, %union_move6 ], [ 1, %union_move ]
%13 = insertvalue { {}*, i8 } { {}* null, i8 undef }, i8 %tindex_phi.lcssa31, 1
ret { {}*, i8 } %13

union_move:                                       ; preds = %L9, %L9.lr.ph
%.lcssa.in = phi float [ %value_phi5.peel, %L9.lr.ph ], [ %value_phi5, %L9 ]
%14 = bitcast [8 x i8]* %0 to float*
store float %.lcssa.in, float* %14, align 8
br label %post_union_move

union_move6:                                      ; preds = %top
%.sroa_cast15 = bitcast [8 x i8]* %0 to i32*
store i32 0, i32* %.sroa_cast15, align 8
%.sroa_idx = getelementptr inbounds [8 x i8], [8 x i8]* %0, i64 0, i64 4
%.sroa_cast16 = bitcast i8* %.sroa_idx to i32*
store i32 0, i32* %.sroa_cast16, align 4
br label %post_union_move
}
``````
``````function test8(x)
c = 0

for i in eachindex(x)
c += x[i]
end

return c
end
``````
``````;  @ h:\Code\Renderer\simdVload.jl:132 within `test8`
; Function Attrs: uwtable
define { {}*, i8 } @julia_test8_3514([8 x i8]* noalias nocapture noundef nonnull align 8 dereferenceable(8) %0, {}* noundef nonnull align 16 dereferenceable(40) %1) #0 {
top:
; β @ abstractarray.jl:314 within `eachindex`
; ββ @ abstractarray.jl:133 within `axes1`
; βββ @ abstractarray.jl:98 within `axes`
; ββββ @ array.jl:149 within `size`
%2 = bitcast {}* %1 to { i8*, i64, i16, i16, i32 }*
%3 = getelementptr inbounds { i8*, i64, i16, i16, i32 }, { i8*, i64, i16, i16, i32 }* %2, i64 0, i32 1
%4 = load i64, i64* %3, align 8
; ββββ
; β @ range.jl:887 within `iterate`
; ββ @ range.jl:662 within `isempty`
; βββ @ operators.jl:369 within `>`
; ββββ @ int.jl:83 within `<`
%.not.not = icmp eq i64 %4, 0
; ββββ
br i1 %.not.not, label %union_move15, label %idxend.lr.ph

idxend.lr.ph:                                     ; preds = %top
%5 = bitcast {}* %1 to float**
%6 = load float*, float** %5, align 8
; β @ essentials.jl:13 within `getindex`
br label %idxend

L43:                                              ; preds = %idxend
; β
; β @ range.jl:891 within `iterate`
%7 = add nuw nsw i64 %value_phi347, 1
; β
; β @ range.jl:887 within `iterate`
%8 = bitcast float %value_phi9 to i32
; β
; β @ essentials.jl:13 within `getindex`
br label %idxend

idxend:                                           ; preds = %L43, %idxend.lr.ph
%9 = phi i64 [ 0, %idxend.lr.ph ], [ %value_phi347, %L43 ]
%value_phi548 = phi i1 [ true, %idxend.lr.ph ], [ false, %L43 ]
%value_phi347 = phi i64 [ 1, %idxend.lr.ph ], [ %7, %L43 ]
%.sroa.030.046 = phi i32 [ 0, %idxend.lr.ph ], [ %8, %L43 ]
%10 = getelementptr inbounds float, float* %6, i64 %9
%11 = load float, float* %10, align 4
; β
%12 = bitcast i32 %.sroa.030.046 to float
%13 = uitofp i32 %.sroa.030.046 to float
%.pn = select i1 %value_phi548, float %13, float %12
%value_phi9 = fadd float %.pn, %11
; β @ range.jl:891 within `iterate`
; ββ @ promotion.jl:499 within `==`
%.not = icmp eq i64 %value_phi347, %4
; ββ
br i1 %.not, label %union_move, label %L43

post_union_move:                                  ; preds = %union_move15, %union_move
%tindex_phi1344 = phi i8 [ 2, %union_move15 ], [ 1, %union_move ]
%14 = insertvalue { {}*, i8 } { {}* null, i8 undef }, i8 %tindex_phi1344, 1
ret { {}*, i8 } %14

union_move:                                       ; preds = %idxend
%15 = bitcast [8 x i8]* %0 to float*
store float %value_phi9, float* %15, align 8
br label %post_union_move

union_move15:                                     ; preds = %top
%16 = bitcast [8 x i8]* %0 to i64*
store i64 0, i64* %16, align 8
br label %post_union_move
}
``````

It makes more sense that the while loop has bounds-checking without the `@inbounds`, but now Iβm confused.

I would have expected the inbounds check to make a timing difference in `test7`, I mean the only difference in the `@code_llvm` is that it adds a comparison and branch to possibly throwing an error.

Iβm not sure why the `test8` for loop version would be so much slower, the `@code_llvm` not only looks shorter it weirdly does not have bounds check code like `test7` does (maybe `eachindex` is smart?). It does look like `test8` has a integer-to-float type conversion in the loop that `test7` doesnβt (`uitop`) before the `fadd`. I think this might be rooted in your `c=0`, which makes `c` type-unstable because you may or may not add `Float32`s to an `Int64`. But I could be mistaken because I canβt read LLVM well at all (like why all the i32 if `c` starts at `Int64`) and the `test7` while loop also starts with `c=0`, which I would expect to also be a type instability requiring type conversions. If you want to fix that, you just do `c = zero(eltype(x))`, that should compile to the stable type of zero.

Could you edit your post to include your timing calls and results too? I assume youβre using `@btime` and interpolating a precomputed input array so it isnβt part of the benchmark.

To clarify, when I said

Using `@inbounds` on the while loop made no difference to the speed but looking at the llvm it doesnβt seem to be skipping the bounds checking

I meant that the version not using `@inbounds` wasnβt skipping bounds checking, the version using `@inbounds` does seem to be skiping it.

I think the lack of a timing difference is probably because itβs quite cheap to do the branching especially when it should be able to predict that we get the same result every time.
The problem is due to `c = 0`, changing to `c = zero(eltype(x))` fixes the slowness.

Whatever is causing the for loop to be slow with `c=0` isnβt relevant to the case I care about which is iterating over `1:N:length(x)` so Iβm not going to spend any more time thinking about it.

Now when iterating over `1:N:length(x)`, Iβve managed to get it running 1ns faster than the while loop by adding `@inline` to `steprange_last` and to the constructors of `StepRange`. Iβm not sure if thereβs a downside to this but if not I might make a pr to fix this.

1 Like

It seems I canβt edit my old posts, but here are the benchmarks where `x = rand(Float32, 800)` (with `c=0` i.e. same code for which I posted the llvm)

``````julia> @benchmark test(\$x)
BenchmarkTools.Trial: 10000 samples with 968 evaluations.
Range (min β¦ max):  84.091 ns β¦ 287.397 ns  β GC (min β¦ max): 0.00% β¦ 0.00%
Time  (median):     84.194 ns               β GC (median):    0.00%
Time  (mean Β± Ο):   86.212 ns Β±  14.460 ns  β GC (mean Β± Ο):  0.00% Β± 0.00%

ββ                                                           β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
84.1 ns       Histogram: log(frequency) by time       210 ns <

Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark test4(\$x)
BenchmarkTools.Trial: 10000 samples with 978 evaluations.
Range (min β¦ max):  70.552 ns β¦ 276.074 ns  β GC (min β¦ max): 0.00% β¦ 0.00%
Time  (median):     71.268 ns               β GC (median):    0.00%
Time  (mean Β± Ο):   72.845 ns Β±  11.458 ns  β GC (mean Β± Ο):  0.00% Β± 0.00%

βββ ββββ                                                     β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
70.6 ns       Histogram: log(frequency) by time      95.3 ns <

Memory estimate: 0 bytes, allocs estimate: 0.

@benchmark test7_noinbounds(\$x)
BenchmarkTools.Trial: 10000 samples with 171 evaluations.
Range (min β¦ max):  630.409 ns β¦  4.875 ΞΌs  β GC (min β¦ max): 0.00% β¦ 0.00%
Time  (median):     630.409 ns              β GC (median):    0.00%
Time  (mean Β± Ο):   635.315 ns Β± 67.407 ns  β GC (mean Β± Ο):  0.00% Β± 0.00%

ββ      βββββ                                                β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
630 ns        Histogram: log(frequency) by time       697 ns <

Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark test7_inbounds(\$x)
BenchmarkTools.Trial: 10000 samples with 173 evaluations.
Range (min β¦ max):  628.902 ns β¦  1.970 ΞΌs  β GC (min β¦ max): 0.00% β¦ 0.00%
Time  (median):     630.058 ns              β GC (median):    0.00%
Time  (mean Β± Ο):   644.193 ns Β± 97.365 ns  β GC (mean Β± Ο):  0.00% Β± 0.00%

ββββ                                                         β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
629 ns        Histogram: log(frequency) by time      1.01 ΞΌs <

Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark test8(\$x)
BenchmarkTools.Trial: 10000 samples with 9 evaluations.
Range (min β¦ max):  2.111 ΞΌs β¦  10.022 ΞΌs  β GC (min β¦ max): 0.00% β¦ 0.00%
Time  (median):     2.111 ΞΌs               β GC (median):    0.00%
Time  (mean Β± Ο):   2.192 ΞΌs Β± 471.679 ns  β GC (mean Β± Ο):  0.00% Β± 0.00%

β  β                                                      β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
2.11 ΞΌs      Histogram: log(frequency) by time      5.29 ΞΌs <

Memory estimate: 0 bytes, allocs estimate: 0.
``````

To summarise, the construction of a steprange doesnβt get inlined and so adds a couple nanoseconds of overhead. I made a pr proposing a fix here, Inline StepRange construction by Zentrik Β· Pull Request #49270 Β· JuliaLang/julia Β· GitHub.