Missing iteration invariant optimization?

llvm
optimization
loops

#1

I have a feeling that iterators are harmful. Consider:

function baz1(n)
    for i=1:n
        i <1 && return 1
        i>n &&  return 2
    end
    0
end

function baz2(n)
    I = 1:n
    state = start(I)
    while !done(I, state)
        (i, state) = next(I, state)
        i <1 && return 1
        i>n &&  return 2
    end
    0
end

function baz3(n)
    i=1
    while i<n
        i <1 && return 1
        i>n &&  return 2
        i +=1
    end
    0
end

Now, baz1 and baz2 are equivalent according to docs (and tested with @code_llvm on both 0.6 and 0.7 nightly). Let’s look at the code:

julia> @code_llvm baz1(10)

; Function baz1
; Location: REPL[7]:2
define i64 @julia_baz1_63294(i64) {
top:
  %1 = icmp sgt i64 %0, 0
  %2 = select i1 %1, i64 %0, i64 0
  %3 = add i64 %2, 1
  br label %L3

L3:                                               ; preds = %L12, %top
  %"#temp#.0" = phi i64 [ 1, %top ], [ %6, %L12 ]
  %4 = icmp eq i64 %"#temp#.0", %3
  br i1 %4, label %L18, label %if

if:                                               ; preds = %L3
; Location: REPL[7]:3
  %5 = icmp sgt i64 %"#temp#.0", 0
  br i1 %5, label %L12, label %L18

L18:                                              ; preds = %L12, %if, %L3
  %merge = phi i64 [ 0, %L3 ], [ 1, %if ], [ 2, %L12 ]
; Location: REPL[7]:6
  ret i64 %merge

L12:                                              ; preds = %if
; Location: REPL[7]:2
  %6 = add nuw i64 %"#temp#.0", 1
; Location: REPL[7]:4
  %7 = icmp sgt i64 %"#temp#.0", %0
  br i1 %7, label %L18, label %L3
}

On the other hand,

julia> @code_llvm baz3(10)

; Function baz3
; Location: REPL[3]:2
define i64 @julia_baz3_63295(i64) {
top:
; Location: REPL[3]:3
  br i1 undef, label %L17, label %L17

L17:                                              ; preds = %top, %top
; Location: REPL[3]:8
  ret i64 0
}

The relevance to bounds checks, and hence general performance, is obvious. I think something must be done to help llvm in “for i=a:b” loops, be it more annotations, a special case for iterators over unit ranges, or a new llvm pass.

Is there a way to obtain a better code_llvm that contains all optimization metadata (all the align / noalias / etc annotations)? Is there a way to obtain code_llvm at each pass?


#2

Big oops. baz3 is of course not equivalent to baz1 and baz2, the correct code is

julia> function baz4(n)
           i=1
           while i<=n
               i <1 && return 1
               i>n &&  return 2
               i +=1
           end
           0
       end

julia> @code_llvm baz4(10)

; Function baz4
; Location: REPL[16]:2
define i64 @julia_baz4_63322(i64) {
top:
; Location: REPL[16]:3
  %1 = icmp slt i64 %0, 1
  br i1 %1, label %L17, label %if.preheader

if.preheader:                                     ; preds = %top
; Location: REPL[16]:4
  br label %if

L3:                                               ; preds = %if
; Location: REPL[16]:6
  %2 = add nuw i64 %i.03, 1
; Location: REPL[16]:3
  %3 = icmp sgt i64 %2, %0
  br i1 %3, label %L17.loopexit, label %if

if:                                               ; preds = %if.preheader, %L3
  %i.03 = phi i64 [ %2, %L3 ], [ 1, %if.preheader ]
; Location: REPL[16]:4
  %4 = icmp sgt i64 %i.03, 0
  br i1 %4, label %L3, label %L17.loopexit

L17.loopexit:                                     ; preds = %if, %L3
  %merge.ph = phi i64 [ 1, %if ], [ 0, %L3 ]
; Location: REPL[16]:8
  br label %L17

L17:                                              ; preds = %L17.loopexit, %top
  %merge = phi i64 [ 0, %top ], [ %merge.ph, %L17.loopexit ]
  ret i64 %merge
}

which manages to remove only case 2. Still, there should be a way to get these simple loop invariants figured out.

Edit: “gcc -O3” figures this out in C; “clang -O3” doesn’t (tested clang 5.0 and gcc 7.2.0).


#3

The title was rather overblown so I’ve edited it to be a bit more restrained. Since clang does not do this optimization it’s not too surprising that Julia doesn’t either – it’s LLVM either way. This isn’t a fundamental issue with iterators though, it’s just a missing optimization in LLVM. My guess is that since this kind of optimization is quite natural to do “by hand” (i.e. notice that the branch conditions can’t be met and delete them entirely) it hasn’t been a pressing matter for LLVM. Do you have a use case where this ends up being a problem?


#4

Re title: sorry for the bad vibes; thanks for correcting this.

using BenchmarkTools
function enumsumtest(V)
    s=0
    for i=1:length(V)
        for (j,v) = enumerate(V)
            s+=v
        end
    end
    s
end

function ib_enumsumtest(V)
    s=0
    @inbounds for i=1:length(V)
        for (j,v) in enumerate(V)
            s+=v
        end
    end
    s
end
V = Vector{Int}(5_000);
@btime enumsumtest(V)
  21.564 ms (1 allocation: 16 bytes)

@btime ib_enumsumtest(V)
  3.009 ms (1 allocation: 16 bytes)

#5

In order to optimize out the bounds-checks, there are actually three issues: First, the enumerate confuses llvm (baz1/baz2 vs baz4); second, llvm somehow fails at something (baz3 vs baz4), and third llvm fails to track the length of the vector (it thinks the length is volatile?).

Consider:

function enumsumtest_wrong_hint(V)
    s=0
    n=length(V)
    for i=1:n
        j=1
        while j<n
            j < 1 && return 1
            j > n && return 2
            @inbounds s+=V[j]
            j += 1
        end
    end
    s
end
@btime enumsumtest_wrong_hint(V)
  3.052 ms (1 allocation: 16 bytes)

which is tantalizingly close to what we want!

On the other hand,

function enumsumtest_wrong_nohint(V)
    s=0
    n=length(V)
    for i=1:n
        j=1
        while j<length(V)
            j < 1 && return 1
            j > length(V) && return 2
            @inbounds s+=V[j]
            j += 1
        end
    end
    s
end
@btime enumsumtest_wrong_nohint(V)
  19.277 ms (1 allocation: 16 bytes)

Edit: I rewrote this and opened an issue https://github.com/JuliaLang/julia/issues/24925


#6

Something that should give us hope for getting rid of most boundschecks:

The following correctly vectorizes on 0.6 and emits no boundschecks (even though there is no @inbounds):

function foo_x(V::Vector{T}) where {T}
    s=zero(T)
    state = 0
    while reinterpret(UInt64,state) <  reinterpret(UInt64, size(V,1))
        state += 1
        v =V[state]
        s += v
    end
    return s
end