# Exponentially long compile times in generated functions that disappear after manual inlining

I have been using TensorOperations.jl for fast N-dimensional Array operations, which are written using generated functions that produce N nested @simd for loops. I discovered that while the runtime of these functions is very fast (TensorOperations.tensorcopy! is much faster than the equivalent Base.permutedims! for computing tensor transposes of large arrays), the compile time was very long for large (N >= 15) rank arrays. While the application of arrays with ranks >= 15 is somewhat niche due to the large memory requirements, 2x2x2x…x2 arrays are useful for numerical quantum physics with dimensions as large as N=30.

For comparison, here are the compile times and runtimes for a transpose! function (the full code is given in the example below) which is a drop-in replacement for Base.permutedims!. In the following, the arrays used are dimension N with size 2x2x2x…x2.

N=16

• Base.permutedims! compile time: 0.65 sec
• Base.permutedims! run time: 2.20 sec
• transpose! compile time: 29.1 sec
• transpose! run time: 0.00069 sec

N=17

• Base.permutedims! compile time: 0.5 sec
• Base.permutedims! run time: 4.8 sec
• transpose! compile time: 56.8 sec
• transpose! run time: 0.00107 sec

The compile time of transpose! is independent of the size of the arrays and approximately doubles with an increment in N by 1. The time to compute the transposes of the arrays is then approximately linear in the number of elements in that array, and so the runtime of transpose for a 2x2x2x…x2 array of dimension N also approximately doubles with each increment of N by 1 – but as you see in the table above, the compile time is vastly larger than the runtime.
Scaling this up to N=30, transpose! will take over 5 days to compile but 9 seconds to run, while permutedims! will run in approximately 11 hours.

In order to get the benefit of the fast @simd loops, it would be nice to not have compile times that are so large. To my surprise, I discovered that a simple one line change made the long compile times go away, and I’d like to understand why.

A simplified working example with the long compile times is here:

``````macro stridedloops(N, dims, args...)
esc(_stridedloops(N, dims, args...))
end

function _stridedloops(N::Int, dims::Symbol, args...)
mod(length(args),2)==1 || error("Wrong number of arguments")
argiter = 1:2:length(args)-1
body = args[end]
pre = [Expr(:(=), args[i], Symbol(args[i],0)) for i in argiter]
ex = Expr(:block, pre..., body)
for d = 1:N
pre = [Expr(:(=), Symbol(args[i], d-1), Symbol(args[i], d)) for i in argiter]
post = [Expr(:(+=), Symbol(args[i], d), Expr(:ref, args[i+1], d)) for i in argiter]
ex = Expr(:block, pre..., ex, post...)
rangeex = Expr(:(:), 1, Expr(:ref, dims, d))
forex = Expr(:(=), gensym(), rangeex)
ex = Expr(:for, forex, ex)
if d==1
ex = Expr(:macrocall, Symbol("@simd"), ex)
end
end
pre = [Expr(:(=),Symbol(args[i],N),1) for i in argiter]
ex = Expr(:block, pre..., ex)
end

immutable StridedData{N,T}
data::Vector{T}
strides::NTuple{N,Int}
end
Base.getindex(a::StridedData,i) = a.data[i]
Base.setindex!(a::StridedData,v,i) = (@inbounds a.data[i] = v)

@generated function transpose_micro!{N}(C::StridedData{N}, A::StridedData{N}, dims::NTuple{N, Int})
quote
stridesA = A.strides
stridesC = C.strides
@stridedloops(\$N, dims, indA, stridesA, indC, stridesC, @inbounds C[indC]=A[indA])
return C
end
end

function transpose!(C::StridedArray, A::StridedArray, perm)
N = ndims(A)
stridesA = ntuple(i->stride(A, i), N)
stridesC = ntuple(i->stride(C, i), N)
stridesA = stridesA[perm]
minstrides = ntuple(i->min(stridesA[i], stridesC[i]), N)
p = sortperm(collect(minstrides))
dims = size(C)
dims = dims[p]
stridesA = stridesA[p]
stridesC = stridesC[p]
minstrides = minstrides[p]
dataA = StridedData(vec(A), stridesA)
dataC = StridedData(vec(C), stridesC)
transpose_micro!(dataC, dataA, dims)
return C
end

for N in 10:17
A = rand(Complex128, fill(2, N)...)
C1 = zeros(Complex128, fill(2, N)...)
C2 = zeros(Complex128, fill(2, N)...)
perm = randperm(N)
println("N=\$N")
println("transpose!")
@time transpose!(C1, A, perm)
@time transpose!(C1, A, perm)
println("permutedims!")
@time permutedims!(C2, A, perm)
@time permutedims!(C2, A, perm)
@assert C1 ≈ C2
end
``````

Oddly, the compile times are drastically improved by changing the body of the @generated function from

``````@generated function transpose_micro!{N}(C::StridedData{N}, A::StridedData{N}, dims::NTuple{N, Int})
quote
stridesA = A.strides
stridesC = C.strides
@stridedloops(\$N, dims, indA, stridesA, indC, stridesC, @inbounds C[indC]=A[indA])
return C
end
end
``````

to

``````@generated function transpose_micro!{N}(C::StridedData{N}, A::StridedData{N}, dims::NTuple{N, Int})
quote
@stridedloops(\$N, dims, indA, A.strides, indC, C.strides, @inbounds C[indC]=A[indA])
return C
end
end
``````

which results in the times
N=17

• transpose! compile time: 0.56 sec
• transpose! run time: 0.00198 sec.
Additionally, the compile time scales linearly in N instead of exponentially, so the operation becomes fast for larger N.

When looking at the bodies of these generated functions, there isn’t an obvious indicator for why the compile times should be large in one version but not the other: the difference is a simple variable substitution with instances of A.strides in the body replaced by stridesA, preceded by the line stridesA = A.strides.

I’m wondering if there is a simple way to understand and improve the compile-time performance of these generated functions, or if this is a performance bug of some sort.

6 Likes

Personally, any function that takes 5 days to compile and 9 seconds to run counts as a “bug” in my book.

1 Like

Could you post the expanded code for these functions `code_typed(transpose_micro!, (argtypes...), optimize=false)`? Since `@stridedloops` is a macro, it’s pretty hard to determine at a glance what it will do. Also, is the time spent during `@time code_typed` or during `@time code_llvm` or during `@time code_native`?

The time is spent about equally between code_llvm and code_native. Almost no time is spent in code_typed.
Here is the output of

``````N=5
code_typed(transpose_micro!, (StridedData{N, Float64}, StridedData{N, Float64}, NTuple{N, Int}), optimize=false)
``````

with the slow compiling transpose_micro!:

``````1-element Array{Any,1}:
CodeInfo(:(begin  # line 34:
# meta: location /home/brayden/Code/ncon/test/mwe.jl # line 35:
stridesA = (Core.getfield)(A, :strides)::NTuple{5,Int64} # line 36:
stridesC = (Core.getfield)(C, :strides)::NTuple{5,Int64} # line 37:
indA5 = 1
indC5 = 1
SSAValue(0) = (Main.colon)(1, (Main.getindex)(dims, 5)::Int64)::UnitRange{Int64}
#temp#@_31 = (Base.start)(SSAValue(0))::Int64
12:
unless !((Base.done)(SSAValue(0), #temp#@_31)::Bool)::Bool goto 119
SSAValue(1) = (Base.next)(SSAValue(0), #temp#@_31)::Tuple{Int64,Int64}
780 = (Core.getfield)(SSAValue(1), 1)::Int64
#temp#@_31 = (Core.getfield)(SSAValue(1), 2)::Int64
indA4 = indA5
indC4 = indC5
SSAValue(2) = (Main.colon)(1, (Main.getindex)(dims, 4)::Int64)::UnitRange{Int64}
#temp#@_27 = (Base.start)(SSAValue(2))::Int64
21:
unless !((Base.done)(SSAValue(2), #temp#@_27)::Bool)::Bool goto 114
SSAValue(3) = (Base.next)(SSAValue(2), #temp#@_27)::Tuple{Int64,Int64}
779 = (Core.getfield)(SSAValue(3), 1)::Int64
#temp#@_27 = (Core.getfield)(SSAValue(3), 2)::Int64
indA3 = indA4
indC3 = indC4
SSAValue(4) = (Main.colon)(1, (Main.getindex)(dims, 3)::Int64)::UnitRange{Int64}
#temp#@_23 = (Base.start)(SSAValue(4))::Int64
30:
unless !((Base.done)(SSAValue(4), #temp#@_23)::Bool)::Bool goto 109
SSAValue(5) = (Base.next)(SSAValue(4), #temp#@_23)::Tuple{Int64,Int64}
778 = (Core.getfield)(SSAValue(5), 1)::Int64
#temp#@_23 = (Core.getfield)(SSAValue(5), 2)::Int64
indA2 = indA3
indC2 = indC3
SSAValue(6) = (Main.colon)(1, (Main.getindex)(dims, 2)::Int64)::UnitRange{Int64}
#temp#@_19 = (Base.start)(SSAValue(6))::Int64
39:
unless !((Base.done)(SSAValue(6), #temp#@_19)::Bool)::Bool goto 104
SSAValue(7) = (Base.next)(SSAValue(6), #temp#@_19)::Tuple{Int64,Int64}
777 = (Core.getfield)(SSAValue(7), 1)::Int64
#temp#@_19 = (Core.getfield)(SSAValue(7), 2)::Int64
indA1 = indA2
indC1 = indC2
# meta: location simdloop.jl # line 65:
r#781 = (Main.colon)(1, (Main.getindex)(dims, 1)::Int64)::UnitRange{Int64} # line 66:
SSAValue(8) = (Base.simd_outer_range)(r#781)::UnitRange{Int64}
#temp#@_14 = (Base.start)(SSAValue(8))::Int64
52:
unless !((Base.done)(SSAValue(8), #temp#@_14)::Bool)::Bool goto 96
SSAValue(9) = (Base.next)(SSAValue(8), #temp#@_14)::Tuple{Int64,Int64}
i#782 = (Core.getfield)(SSAValue(9), 1)::Int64
#temp#@_14 = (Core.getfield)(SSAValue(9), 2)::Int64 # line 67:
NewvarNode(:(776))
n#783 = (Base.simd_inner_length)(r#781, i#782)::Int64 # line 68:
unless ((Main.zero)(n#783)::Int64 < n#783)::Bool goto 94 # line 70:
i#784 = (Main.zero)(n#783)::Int64 # line 71:
NewvarNode(:(indA0))
NewvarNode(:(indC0))
NewvarNode(:(indA))
NewvarNode(:(indC))
NewvarNode(:(##776))
70:
unless (i#784 < n#783)::Bool goto 91 # line 72:
##776 = (Base.simd_index)(r#781, i#782, i#784)::Int64 # line 73:
indA0 = indA1
indC0 = indC1
indA = indA0
indC = indC0
\$(Expr(:inbounds, true))
SSAValue(10) = (Main.getindex)(A, indA)::Float64
(Main.setindex!)(C, SSAValue(10), indC)::Void
\$(Expr(:inbounds, :pop))
indA1 = (indA1 + (Main.getindex)(stridesA, 1)::Int64)::Int64
indC1 = (indC1 + (Main.getindex)(stridesC, 1)::Int64)::Int64 # line 74:
i#784 = (i#784 + 1)::Int64 # line 75:
\$(Expr(:simdloop))
89:
goto 70
91:  # line 79:
776 = (Main.last)(r#781)::Int64
94:
goto 52
96:  # line 84:
Main.nothing
# meta: pop location
indA2 = (indA2 + (Main.getindex)(stridesA, 2)::Int64)::Int64
indC2 = (indC2 + (Main.getindex)(stridesC, 2)::Int64)::Int64
102:
goto 39
104:
indA3 = (indA3 + (Main.getindex)(stridesA, 3)::Int64)::Int64
indC3 = (indC3 + (Main.getindex)(stridesC, 3)::Int64)::Int64
107:
goto 30
109:
indA4 = (indA4 + (Main.getindex)(stridesA, 4)::Int64)::Int64
indC4 = (indC4 + (Main.getindex)(stridesC, 4)::Int64)::Int64
112:
goto 21
114:
indA5 = (indA5 + (Main.getindex)(stridesA, 5)::Int64)::Int64
indC5 = (indC5 + (Main.getindex)(stridesC, 5)::Int64)::Int64
117:
goto 12
119:  # line 38:
SSAValue(11) = C
# meta: pop location
return SSAValue(11)
end))=>StridedData{5,Float64}
``````

and here is the same with the fast compiling transpose_micro!:

``````1-element Array{Any,1}:
CodeInfo(:(begin  # line 43:
# meta: location /home/brayden/Code/ncon/test/mwe.jl # line 44:
indA5 = 1
indC5 = 1
SSAValue(0) = (Main.colon)(1, (Main.getindex)(dims, 5)::Int64)::UnitRange{Int64}
#temp#@_31 = (Base.start)(SSAValue(0))::Int64
8:
unless !((Base.done)(SSAValue(0), #temp#@_31)::Bool)::Bool goto 115
SSAValue(1) = (Base.next)(SSAValue(0), #temp#@_31)::Tuple{Int64,Int64}
789 = (Core.getfield)(SSAValue(1), 1)::Int64
#temp#@_31 = (Core.getfield)(SSAValue(1), 2)::Int64
indA4 = indA5
indC4 = indC5
SSAValue(2) = (Main.colon)(1, (Main.getindex)(dims, 4)::Int64)::UnitRange{Int64}
#temp#@_27 = (Base.start)(SSAValue(2))::Int64
17:
unless !((Base.done)(SSAValue(2), #temp#@_27)::Bool)::Bool goto 110
SSAValue(3) = (Base.next)(SSAValue(2), #temp#@_27)::Tuple{Int64,Int64}
788 = (Core.getfield)(SSAValue(3), 1)::Int64
#temp#@_27 = (Core.getfield)(SSAValue(3), 2)::Int64
indA3 = indA4
indC3 = indC4
SSAValue(4) = (Main.colon)(1, (Main.getindex)(dims, 3)::Int64)::UnitRange{Int64}
#temp#@_23 = (Base.start)(SSAValue(4))::Int64
26:
unless !((Base.done)(SSAValue(4), #temp#@_23)::Bool)::Bool goto 105
SSAValue(5) = (Base.next)(SSAValue(4), #temp#@_23)::Tuple{Int64,Int64}
787 = (Core.getfield)(SSAValue(5), 1)::Int64
#temp#@_23 = (Core.getfield)(SSAValue(5), 2)::Int64
indA2 = indA3
indC2 = indC3
SSAValue(6) = (Main.colon)(1, (Main.getindex)(dims, 2)::Int64)::UnitRange{Int64}
#temp#@_19 = (Base.start)(SSAValue(6))::Int64
35:
unless !((Base.done)(SSAValue(6), #temp#@_19)::Bool)::Bool goto 100
SSAValue(7) = (Base.next)(SSAValue(6), #temp#@_19)::Tuple{Int64,Int64}
786 = (Core.getfield)(SSAValue(7), 1)::Int64
#temp#@_19 = (Core.getfield)(SSAValue(7), 2)::Int64
indA1 = indA2
indC1 = indC2
# meta: location simdloop.jl # line 65:
r#790 = (Main.colon)(1, (Main.getindex)(dims, 1)::Int64)::UnitRange{Int64} # line 66:
SSAValue(8) = (Base.simd_outer_range)(r#790)::UnitRange{Int64}
#temp#@_14 = (Base.start)(SSAValue(8))::Int64
48:
unless !((Base.done)(SSAValue(8), #temp#@_14)::Bool)::Bool goto 92
SSAValue(9) = (Base.next)(SSAValue(8), #temp#@_14)::Tuple{Int64,Int64}
i#791 = (Core.getfield)(SSAValue(9), 1)::Int64
#temp#@_14 = (Core.getfield)(SSAValue(9), 2)::Int64 # line 67:
NewvarNode(:(785))
n#792 = (Base.simd_inner_length)(r#790, i#791)::Int64 # line 68:
unless ((Main.zero)(n#792)::Int64 < n#792)::Bool goto 90 # line 70:
i#793 = (Main.zero)(n#792)::Int64 # line 71:
NewvarNode(:(indA0))
NewvarNode(:(indC0))
NewvarNode(:(indA))
NewvarNode(:(indC))
NewvarNode(:(##785))
66:
unless (i#793 < n#792)::Bool goto 87 # line 72:
##785 = (Base.simd_index)(r#790, i#791, i#793)::Int64 # line 73:
indA0 = indA1
indC0 = indC1
indA = indA0
indC = indC0
\$(Expr(:inbounds, true))
SSAValue(10) = (Main.getindex)(A, indA)::Float64
(Main.setindex!)(C, SSAValue(10), indC)::Void
\$(Expr(:inbounds, :pop))
indA1 = (indA1 + (Main.getindex)((Core.getfield)(A, :strides)::NTuple{5,Int64}, 1)::Int64)::Int64
indC1 = (indC1 + (Main.getindex)((Core.getfield)(C, :strides)::NTuple{5,Int64}, 1)::Int64)::Int64 # line 74:
i#793 = (i#793 + 1)::Int64 # line 75:
\$(Expr(:simdloop))
85:
goto 66
87:  # line 79:
785 = (Main.last)(r#790)::Int64
90:
goto 48
92:  # line 84:
Main.nothing
# meta: pop location
indA2 = (indA2 + (Main.getindex)((Core.getfield)(A, :strides)::NTuple{5,Int64}, 2)::Int64)::Int64
indC2 = (indC2 + (Main.getindex)((Core.getfield)(C, :strides)::NTuple{5,Int64}, 2)::Int64)::Int64
98:
goto 35
100:
indA3 = (indA3 + (Main.getindex)((Core.getfield)(A, :strides)::NTuple{5,Int64}, 3)::Int64)::Int64
indC3 = (indC3 + (Main.getindex)((Core.getfield)(C, :strides)::NTuple{5,Int64}, 3)::Int64)::Int64
103:
goto 26
105:
indA4 = (indA4 + (Main.getindex)((Core.getfield)(A, :strides)::NTuple{5,Int64}, 4)::Int64)::Int64
indC4 = (indC4 + (Main.getindex)((Core.getfield)(C, :strides)::NTuple{5,Int64}, 4)::Int64)::Int64
108:
goto 17
110:
indA5 = (indA5 + (Main.getindex)((Core.getfield)(A, :strides)::NTuple{5,Int64}, 5)::Int64)::Int64
indC5 = (indC5 + (Main.getindex)((Core.getfield)(C, :strides)::NTuple{5,Int64}, 5)::Int64)::Int64
113:
goto 8
115:  # line 45:
SSAValue(11) = C
# meta: pop location
return SSAValue(11)
end))=>StridedData{5,Float64}
``````

As you can see, the only difference is a variable substitution. The single call stridesA = (Core.getfield)(A, :strides)::NTuple{5,Int64} is repeated everywhere stridesA is needed in the 5 nested for loops in the second version. Perhaps this is why the run time of the second version is slightly slower, but it doesn’t seem to explain the drastic difference in the compile time.

I should also mention which Julia version I’m using.
The behavior I describe above is in Julia 0.5.2, specifically:

``````Julia Version 0.5.2
Commit f4c6c9d4bb (2017-05-06 16:34 UTC)
Platform Info:
OS: Linux (x86_64-linux-gnu)
CPU: Intel(R) Core(TM) i5-6300HQ CPU @ 2.30GHz
WORD_SIZE: 64
BLAS: libopenblas (NO_LAPACKE DYNAMIC_ARCH NO_AFFINITY Haswell)
LAPACK: liblapack.so.3
LIBM: libopenlibm
LLVM: libLLVM-3.7.1 (ORCJIT, broadwell)
``````

I just had a chance to test the same on julia 0.6.0 with different results. Now both versions of transpose_micro! take as long to compile as the slow version of transpose_micro! in julia 0.5.2! The version of llvm is also different: specifically

``````Julia Version 0.6.0
Commit 9036443 (2017-06-19 13:05 UTC)
Platform Info:
OS: Linux (x86_64-linux-gnu)
CPU: Intel(R) Core(TM) i5-6300HQ CPU @ 2.30GHz
WORD_SIZE: 64
BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Haswell)
LAPACK: libopenblas64_
LIBM: libopenlibm
LLVM: libLLVM-3.9.1 (ORCJIT, skylake)
``````

I’ll try a more recent julia from the master branch to see what happens there.