Exponentially long compile times in generated functions that disappear after manual inlining

performance
simd
array

#1

I have been using TensorOperations.jl for fast N-dimensional Array operations, which are written using generated functions that produce N nested @simd for loops. I discovered that while the runtime of these functions is very fast (TensorOperations.tensorcopy! is much faster than the equivalent Base.permutedims! for computing tensor transposes of large arrays), the compile time was very long for large (N >= 15) rank arrays. While the application of arrays with ranks >= 15 is somewhat niche due to the large memory requirements, 2x2x2x…x2 arrays are useful for numerical quantum physics with dimensions as large as N=30.

For comparison, here are the compile times and runtimes for a transpose! function (the full code is given in the example below) which is a drop-in replacement for Base.permutedims!. In the following, the arrays used are dimension N with size 2x2x2x…x2.

N=16

  • Base.permutedims! compile time: 0.65 sec
  • Base.permutedims! run time: 2.20 sec
  • transpose! compile time: 29.1 sec
  • transpose! run time: 0.00069 sec

N=17

  • Base.permutedims! compile time: 0.5 sec
  • Base.permutedims! run time: 4.8 sec
  • transpose! compile time: 56.8 sec
  • transpose! run time: 0.00107 sec

The compile time of transpose! is independent of the size of the arrays and approximately doubles with an increment in N by 1. The time to compute the transposes of the arrays is then approximately linear in the number of elements in that array, and so the runtime of transpose for a 2x2x2x…x2 array of dimension N also approximately doubles with each increment of N by 1 – but as you see in the table above, the compile time is vastly larger than the runtime.
Scaling this up to N=30, transpose! will take over 5 days to compile but 9 seconds to run, while permutedims! will run in approximately 11 hours.

In order to get the benefit of the fast @simd loops, it would be nice to not have compile times that are so large. To my surprise, I discovered that a simple one line change made the long compile times go away, and I’d like to understand why.

A simplified working example with the long compile times is here:

macro stridedloops(N, dims, args...)
    esc(_stridedloops(N, dims, args...))
end

function _stridedloops(N::Int, dims::Symbol, args...)
    mod(length(args),2)==1 || error("Wrong number of arguments")
    argiter = 1:2:length(args)-1
    body = args[end]
    pre = [Expr(:(=), args[i], Symbol(args[i],0)) for i in argiter]
    ex = Expr(:block, pre..., body)
    for d = 1:N
        pre = [Expr(:(=), Symbol(args[i], d-1), Symbol(args[i], d)) for i in argiter]
        post = [Expr(:(+=), Symbol(args[i], d), Expr(:ref, args[i+1], d)) for i in argiter]
        ex = Expr(:block, pre..., ex, post...)
        rangeex = Expr(:(:), 1, Expr(:ref, dims, d))
        forex = Expr(:(=), gensym(), rangeex)
        ex = Expr(:for, forex, ex)
        if d==1
            ex = Expr(:macrocall, Symbol("@simd"), ex)
        end
    end
    pre = [Expr(:(=),Symbol(args[i],N),1) for i in argiter]
    ex = Expr(:block, pre..., ex)
end

immutable StridedData{N,T}
    data::Vector{T}
    strides::NTuple{N,Int}
end
Base.getindex(a::StridedData,i) = a.data[i]
Base.setindex!(a::StridedData,v,i) = (@inbounds a.data[i] = v)

@generated function transpose_micro!{N}(C::StridedData{N}, A::StridedData{N}, dims::NTuple{N, Int})
    quote
        stridesA = A.strides
        stridesC = C.strides
        @stridedloops($N, dims, indA, stridesA, indC, stridesC, @inbounds C[indC]=A[indA])
        return C
    end
end

function transpose!(C::StridedArray, A::StridedArray, perm)
    N = ndims(A)
    stridesA = ntuple(i->stride(A, i), N)
    stridesC = ntuple(i->stride(C, i), N)
    stridesA = stridesA[perm]
    minstrides = ntuple(i->min(stridesA[i], stridesC[i]), N)
    p = sortperm(collect(minstrides))
    dims = size(C)
    dims = dims[p]
    stridesA = stridesA[p]
    stridesC = stridesC[p]
    minstrides = minstrides[p]
    dataA = StridedData(vec(A), stridesA)
    dataC = StridedData(vec(C), stridesC)
    transpose_micro!(dataC, dataA, dims)
    return C
end

for N in 10:17
    A = rand(Complex128, fill(2, N)...)
    C1 = zeros(Complex128, fill(2, N)...)
    C2 = zeros(Complex128, fill(2, N)...)
    perm = randperm(N)
    println("N=$N")
    println("transpose!")
    @time transpose!(C1, A, perm)
    @time transpose!(C1, A, perm)
    println("permutedims!")
    @time permutedims!(C2, A, perm)
    @time permutedims!(C2, A, perm)
    @assert C1 ≈ C2
end

Oddly, the compile times are drastically improved by changing the body of the @generated function from

@generated function transpose_micro!{N}(C::StridedData{N}, A::StridedData{N}, dims::NTuple{N, Int})
    quote
        stridesA = A.strides
        stridesC = C.strides
        @stridedloops($N, dims, indA, stridesA, indC, stridesC, @inbounds C[indC]=A[indA])
        return C
    end
end

to

@generated function transpose_micro!{N}(C::StridedData{N}, A::StridedData{N}, dims::NTuple{N, Int})
    quote
        @stridedloops($N, dims, indA, A.strides, indC, C.strides, @inbounds C[indC]=A[indA])
        return C
    end
end

which results in the times
N=17

  • transpose! compile time: 0.56 sec
  • transpose! run time: 0.00198 sec.
    Additionally, the compile time scales linearly in N instead of exponentially, so the operation becomes fast for larger N.

When looking at the bodies of these generated functions, there isn’t an obvious indicator for why the compile times should be large in one version but not the other: the difference is a simple variable substitution with instances of A.strides in the body replaced by stridesA, preceded by the line stridesA = A.strides.

I’m wondering if there is a simple way to understand and improve the compile-time performance of these generated functions, or if this is a performance bug of some sort.


#2

Personally, any function that takes 5 days to compile and 9 seconds to run counts as a “bug” in my book.


#3

Could you post the expanded code for these functions code_typed(transpose_micro!, (argtypes...), optimize=false)? Since @stridedloops is a macro, it’s pretty hard to determine at a glance what it will do. Also, is the time spent during @time code_typed or during @time code_llvm or during @time code_native?


#4

The time is spent about equally between code_llvm and code_native. Almost no time is spent in code_typed.
Here is the output of

N=5
code_typed(transpose_micro!, (StridedData{N, Float64}, StridedData{N, Float64}, NTuple{N, Int}), optimize=false)

with the slow compiling transpose_micro!:

1-element Array{Any,1}:
 CodeInfo(:(begin  # line 34:
        # meta: location /home/brayden/Code/ncon/test/mwe.jl # line 35:
        stridesA = (Core.getfield)(A, :strides)::NTuple{5,Int64} # line 36:
        stridesC = (Core.getfield)(C, :strides)::NTuple{5,Int64} # line 37:
        indA5 = 1
        indC5 = 1
        SSAValue(0) = (Main.colon)(1, (Main.getindex)(dims, 5)::Int64)::UnitRange{Int64}
        #temp#@_31 = (Base.start)(SSAValue(0))::Int64
        12: 
        unless !((Base.done)(SSAValue(0), #temp#@_31)::Bool)::Bool goto 119
        SSAValue(1) = (Base.next)(SSAValue(0), #temp#@_31)::Tuple{Int64,Int64}
        780 = (Core.getfield)(SSAValue(1), 1)::Int64
        #temp#@_31 = (Core.getfield)(SSAValue(1), 2)::Int64
        indA4 = indA5
        indC4 = indC5
        SSAValue(2) = (Main.colon)(1, (Main.getindex)(dims, 4)::Int64)::UnitRange{Int64}
        #temp#@_27 = (Base.start)(SSAValue(2))::Int64
        21: 
        unless !((Base.done)(SSAValue(2), #temp#@_27)::Bool)::Bool goto 114
        SSAValue(3) = (Base.next)(SSAValue(2), #temp#@_27)::Tuple{Int64,Int64}
        779 = (Core.getfield)(SSAValue(3), 1)::Int64
        #temp#@_27 = (Core.getfield)(SSAValue(3), 2)::Int64
        indA3 = indA4
        indC3 = indC4
        SSAValue(4) = (Main.colon)(1, (Main.getindex)(dims, 3)::Int64)::UnitRange{Int64}
        #temp#@_23 = (Base.start)(SSAValue(4))::Int64
        30: 
        unless !((Base.done)(SSAValue(4), #temp#@_23)::Bool)::Bool goto 109
        SSAValue(5) = (Base.next)(SSAValue(4), #temp#@_23)::Tuple{Int64,Int64}
        778 = (Core.getfield)(SSAValue(5), 1)::Int64
        #temp#@_23 = (Core.getfield)(SSAValue(5), 2)::Int64
        indA2 = indA3
        indC2 = indC3
        SSAValue(6) = (Main.colon)(1, (Main.getindex)(dims, 2)::Int64)::UnitRange{Int64}
        #temp#@_19 = (Base.start)(SSAValue(6))::Int64
        39: 
        unless !((Base.done)(SSAValue(6), #temp#@_19)::Bool)::Bool goto 104
        SSAValue(7) = (Base.next)(SSAValue(6), #temp#@_19)::Tuple{Int64,Int64}
        777 = (Core.getfield)(SSAValue(7), 1)::Int64
        #temp#@_19 = (Core.getfield)(SSAValue(7), 2)::Int64
        indA1 = indA2
        indC1 = indC2
        # meta: location simdloop.jl # line 65:
        r#781 = (Main.colon)(1, (Main.getindex)(dims, 1)::Int64)::UnitRange{Int64} # line 66:
        SSAValue(8) = (Base.simd_outer_range)(r#781)::UnitRange{Int64}
        #temp#@_14 = (Base.start)(SSAValue(8))::Int64
        52: 
        unless !((Base.done)(SSAValue(8), #temp#@_14)::Bool)::Bool goto 96
        SSAValue(9) = (Base.next)(SSAValue(8), #temp#@_14)::Tuple{Int64,Int64}
        i#782 = (Core.getfield)(SSAValue(9), 1)::Int64
        #temp#@_14 = (Core.getfield)(SSAValue(9), 2)::Int64 # line 67:
        NewvarNode(:(776))
        n#783 = (Base.simd_inner_length)(r#781, i#782)::Int64 # line 68:
        unless ((Main.zero)(n#783)::Int64 < n#783)::Bool goto 94 # line 70:
        i#784 = (Main.zero)(n#783)::Int64 # line 71:
        NewvarNode(:(indA0))
        NewvarNode(:(indC0))
        NewvarNode(:(indA))
        NewvarNode(:(indC))
        NewvarNode(:(##776))
        70: 
        unless (i#784 < n#783)::Bool goto 91 # line 72:
        ##776 = (Base.simd_index)(r#781, i#782, i#784)::Int64 # line 73:
        indA0 = indA1
        indC0 = indC1
        indA = indA0
        indC = indC0
        $(Expr(:inbounds, true))
        SSAValue(10) = (Main.getindex)(A, indA)::Float64
        (Main.setindex!)(C, SSAValue(10), indC)::Void
        $(Expr(:inbounds, :pop))
        indA1 = (indA1 + (Main.getindex)(stridesA, 1)::Int64)::Int64
        indC1 = (indC1 + (Main.getindex)(stridesC, 1)::Int64)::Int64 # line 74:
        i#784 = (i#784 + 1)::Int64 # line 75:
        $(Expr(:simdloop))
        89: 
        goto 70
        91:  # line 79:
        776 = (Main.last)(r#781)::Int64
        94: 
        goto 52
        96:  # line 84:
        Main.nothing
        # meta: pop location
        indA2 = (indA2 + (Main.getindex)(stridesA, 2)::Int64)::Int64
        indC2 = (indC2 + (Main.getindex)(stridesC, 2)::Int64)::Int64
        102: 
        goto 39
        104: 
        indA3 = (indA3 + (Main.getindex)(stridesA, 3)::Int64)::Int64
        indC3 = (indC3 + (Main.getindex)(stridesC, 3)::Int64)::Int64
        107: 
        goto 30
        109: 
        indA4 = (indA4 + (Main.getindex)(stridesA, 4)::Int64)::Int64
        indC4 = (indC4 + (Main.getindex)(stridesC, 4)::Int64)::Int64
        112: 
        goto 21
        114: 
        indA5 = (indA5 + (Main.getindex)(stridesA, 5)::Int64)::Int64
        indC5 = (indC5 + (Main.getindex)(stridesC, 5)::Int64)::Int64
        117: 
        goto 12
        119:  # line 38:
        SSAValue(11) = C
        # meta: pop location
        return SSAValue(11)
    end))=>StridedData{5,Float64}

and here is the same with the fast compiling transpose_micro!:

1-element Array{Any,1}:
 CodeInfo(:(begin  # line 43:
        # meta: location /home/brayden/Code/ncon/test/mwe.jl # line 44:
        indA5 = 1
        indC5 = 1
        SSAValue(0) = (Main.colon)(1, (Main.getindex)(dims, 5)::Int64)::UnitRange{Int64}
        #temp#@_31 = (Base.start)(SSAValue(0))::Int64
        8: 
        unless !((Base.done)(SSAValue(0), #temp#@_31)::Bool)::Bool goto 115
        SSAValue(1) = (Base.next)(SSAValue(0), #temp#@_31)::Tuple{Int64,Int64}
        789 = (Core.getfield)(SSAValue(1), 1)::Int64
        #temp#@_31 = (Core.getfield)(SSAValue(1), 2)::Int64
        indA4 = indA5
        indC4 = indC5
        SSAValue(2) = (Main.colon)(1, (Main.getindex)(dims, 4)::Int64)::UnitRange{Int64}
        #temp#@_27 = (Base.start)(SSAValue(2))::Int64
        17: 
        unless !((Base.done)(SSAValue(2), #temp#@_27)::Bool)::Bool goto 110
        SSAValue(3) = (Base.next)(SSAValue(2), #temp#@_27)::Tuple{Int64,Int64}
        788 = (Core.getfield)(SSAValue(3), 1)::Int64
        #temp#@_27 = (Core.getfield)(SSAValue(3), 2)::Int64
        indA3 = indA4
        indC3 = indC4
        SSAValue(4) = (Main.colon)(1, (Main.getindex)(dims, 3)::Int64)::UnitRange{Int64}
        #temp#@_23 = (Base.start)(SSAValue(4))::Int64
        26: 
        unless !((Base.done)(SSAValue(4), #temp#@_23)::Bool)::Bool goto 105
        SSAValue(5) = (Base.next)(SSAValue(4), #temp#@_23)::Tuple{Int64,Int64}
        787 = (Core.getfield)(SSAValue(5), 1)::Int64
        #temp#@_23 = (Core.getfield)(SSAValue(5), 2)::Int64
        indA2 = indA3
        indC2 = indC3
        SSAValue(6) = (Main.colon)(1, (Main.getindex)(dims, 2)::Int64)::UnitRange{Int64}
        #temp#@_19 = (Base.start)(SSAValue(6))::Int64
        35: 
        unless !((Base.done)(SSAValue(6), #temp#@_19)::Bool)::Bool goto 100
        SSAValue(7) = (Base.next)(SSAValue(6), #temp#@_19)::Tuple{Int64,Int64}
        786 = (Core.getfield)(SSAValue(7), 1)::Int64
        #temp#@_19 = (Core.getfield)(SSAValue(7), 2)::Int64
        indA1 = indA2
        indC1 = indC2
        # meta: location simdloop.jl # line 65:
        r#790 = (Main.colon)(1, (Main.getindex)(dims, 1)::Int64)::UnitRange{Int64} # line 66:
        SSAValue(8) = (Base.simd_outer_range)(r#790)::UnitRange{Int64}
        #temp#@_14 = (Base.start)(SSAValue(8))::Int64
        48: 
        unless !((Base.done)(SSAValue(8), #temp#@_14)::Bool)::Bool goto 92
        SSAValue(9) = (Base.next)(SSAValue(8), #temp#@_14)::Tuple{Int64,Int64}
        i#791 = (Core.getfield)(SSAValue(9), 1)::Int64
        #temp#@_14 = (Core.getfield)(SSAValue(9), 2)::Int64 # line 67:
        NewvarNode(:(785))
        n#792 = (Base.simd_inner_length)(r#790, i#791)::Int64 # line 68:
        unless ((Main.zero)(n#792)::Int64 < n#792)::Bool goto 90 # line 70:
        i#793 = (Main.zero)(n#792)::Int64 # line 71:
        NewvarNode(:(indA0))
        NewvarNode(:(indC0))
        NewvarNode(:(indA))
        NewvarNode(:(indC))
        NewvarNode(:(##785))
        66: 
        unless (i#793 < n#792)::Bool goto 87 # line 72:
        ##785 = (Base.simd_index)(r#790, i#791, i#793)::Int64 # line 73:
        indA0 = indA1
        indC0 = indC1
        indA = indA0
        indC = indC0
        $(Expr(:inbounds, true))
        SSAValue(10) = (Main.getindex)(A, indA)::Float64
        (Main.setindex!)(C, SSAValue(10), indC)::Void
        $(Expr(:inbounds, :pop))
        indA1 = (indA1 + (Main.getindex)((Core.getfield)(A, :strides)::NTuple{5,Int64}, 1)::Int64)::Int64
        indC1 = (indC1 + (Main.getindex)((Core.getfield)(C, :strides)::NTuple{5,Int64}, 1)::Int64)::Int64 # line 74:
        i#793 = (i#793 + 1)::Int64 # line 75:
        $(Expr(:simdloop))
        85: 
        goto 66
        87:  # line 79:
        785 = (Main.last)(r#790)::Int64
        90: 
        goto 48
        92:  # line 84:
        Main.nothing
        # meta: pop location
        indA2 = (indA2 + (Main.getindex)((Core.getfield)(A, :strides)::NTuple{5,Int64}, 2)::Int64)::Int64
        indC2 = (indC2 + (Main.getindex)((Core.getfield)(C, :strides)::NTuple{5,Int64}, 2)::Int64)::Int64
        98: 
        goto 35
        100: 
        indA3 = (indA3 + (Main.getindex)((Core.getfield)(A, :strides)::NTuple{5,Int64}, 3)::Int64)::Int64
        indC3 = (indC3 + (Main.getindex)((Core.getfield)(C, :strides)::NTuple{5,Int64}, 3)::Int64)::Int64
        103: 
        goto 26
        105: 
        indA4 = (indA4 + (Main.getindex)((Core.getfield)(A, :strides)::NTuple{5,Int64}, 4)::Int64)::Int64
        indC4 = (indC4 + (Main.getindex)((Core.getfield)(C, :strides)::NTuple{5,Int64}, 4)::Int64)::Int64
        108: 
        goto 17
        110: 
        indA5 = (indA5 + (Main.getindex)((Core.getfield)(A, :strides)::NTuple{5,Int64}, 5)::Int64)::Int64
        indC5 = (indC5 + (Main.getindex)((Core.getfield)(C, :strides)::NTuple{5,Int64}, 5)::Int64)::Int64
        113: 
        goto 8
        115:  # line 45:
        SSAValue(11) = C
        # meta: pop location
        return SSAValue(11)
    end))=>StridedData{5,Float64}

As you can see, the only difference is a variable substitution. The single call stridesA = (Core.getfield)(A, :strides)::NTuple{5,Int64} is repeated everywhere stridesA is needed in the 5 nested for loops in the second version. Perhaps this is why the run time of the second version is slightly slower, but it doesn’t seem to explain the drastic difference in the compile time.


#5

I should also mention which Julia version I’m using.
The behavior I describe above is in Julia 0.5.2, specifically:

Julia Version 0.5.2
Commit f4c6c9d4bb (2017-05-06 16:34 UTC)
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: Intel(R) Core(TM) i5-6300HQ CPU @ 2.30GHz
  WORD_SIZE: 64
  BLAS: libopenblas (NO_LAPACKE DYNAMIC_ARCH NO_AFFINITY Haswell)
  LAPACK: liblapack.so.3
  LIBM: libopenlibm
  LLVM: libLLVM-3.7.1 (ORCJIT, broadwell)

I just had a chance to test the same on julia 0.6.0 with different results. Now both versions of transpose_micro! take as long to compile as the slow version of transpose_micro! in julia 0.5.2! The version of llvm is also different: specifically

Julia Version 0.6.0
Commit 9036443 (2017-06-19 13:05 UTC)
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: Intel(R) Core(TM) i5-6300HQ CPU @ 2.30GHz
  WORD_SIZE: 64
  BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Haswell)
  LAPACK: libopenblas64_
  LIBM: libopenlibm
  LLVM: libLLVM-3.9.1 (ORCJIT, skylake)

I’ll try a more recent julia from the master branch to see what happens there.