Strange issue with Julia hanging

Hi all,

I am posting some strange issue I am currently encountering with TensorOpertions.jl, here in the development category, because I think it’s pretty advanced. Consider the following code:

using TensorOperations
const rows=10;
const cols=10;
const Sx=ComplexF64[0 1;1 0];
const Sz=ComplexF64[1 0;0 -1];
const Sy=ComplexF64[0 -1im;1im 0];
const Id=one(Sz);
@time begin
@tensor eplaq[-1,-2,-3,-4,-5,-6,-7,-8]:=Id[-1,-2]*Id[-3,-4]*Id[-5,-6]*Id[-7,-8]
end; # 27 seconds
plaq=Array{typeof(eplaq),2}(undef,rows,cols)
for r in 1:rows
   for c in 1:cols
	   plaq[r,c]=zero(eplaq)
   end
end;
r = 1;
c = 1;
@time begin
@tensor plaq[r,c][-1,-2,-3,-4,-5,-6,-7,-8]+=Sz[-1,-2]*Sz[-3,-4]*Id[-5,-6]*Id[-7,-8]
end; # 0.005 seconds
precompile(Tuple{typeof(TensorOperations.checked_similar_from_indices), Array{Base.Complex{Float64}, 4}, Type{Base.Complex{Float64}}, Tuple{Int64, Int64}, Tuple{Int64, Int64}, Tuple{Int64, Int64, Int64, Int64}, Tuple{}, Array{Base.Complex{Float64}, 2}, Array{Base.Complex{Float64}, 2}, Symbol, Symbol});
precompile(Tuple{typeof(TensorOperations.checked_similar_from_indices), Array{Base.Complex{Float64}, 4}, Type{Base.Complex{Float64}}, Tuple{Int64, Int64}, Tuple{Int64, Int64}, Tuple{Int64, Int64, Int64, Int64}, Array{Base.Complex{Float64}, 2}, Array{Base.Complex{Float64}, 2}, Symbol, Symbol});
precompile(Tuple{typeof(TensorOperations.checked_similar_from_indices), Array{Base.Complex{Float64}, 6}, Type{Base.Complex{Float64}}, Tuple{Int64, Int64}, Tuple{Int64, Int64, Int64, Int64}, Tuple{Int64, Int64, Int64, Int64, Int64, Int64}, Tuple{}, Array{Base.Complex{Float64}, 2}, Array{Base.Complex{Float64}, 4}, Symbol, Symbol});
precompile(Tuple{typeof(TensorOperations.checked_similar_from_indices), Array{Base.Complex{Float64}, 6}, Type{Base.Complex{Float64}}, Tuple{Int64, Int64}, Tuple{Int64, Int64, Int64, Int64}, Tuple{Int64, Int64, Int64, Int64, Int64, Int64}, Array{Base.Complex{Float64}, 2}, Array{Base.Complex{Float64}, 4}, Symbol, Symbol});
precompile(Tuple{typeof(Base.:(*)), Base.Complex{Float64}, Int64});

t0 = time();
@time for r in 1:rows-1
   @tensor plaq[r,c][-1,-2,-3,-4,-5,-6,-7,-8]+=Sz[-1,-2]*Sz[-3,-4]*Id[-5,-6]*Id[-7,-8]
end; # 0.0002 seconds
t = time() - t0 # actually, that was 85 seconds

t0 = time();
@time for r in 1:rows-1
   @tensor plaq[r,c][-1,-2,-3,-4,-5,-6,-7,-8]+=Sz[-1,-2]*Sz[-3,-4]*Id[-5,-6]*Id[-7,-8]
end; # 0.0002 seconds
t = time() - t0 # 0.03 seconds, fair enough

I run this, just in the REPL (copy paste) of julia 1.1, with the --trace-compile option to monitor what is being compiled. If I start by copying the first block of code, up to the first empty line, everything is fine. The first @tensor call takes some time, but things are getting compiled in the mean time (as I monitor the precompile statements being generated), and I am probably doing some fishy things over in TensorOperations. The second @time statement runs almost instantaneously, even though a few more functions are being compiled.

Now copy paste the first block of code after the line break. Note that the body of the loop is exactly the same as the contents of the second @time statement, which ran instantaneously. No new functions are being compiled, i.e. no new precompile statements are being generated (up to maybe one or two REPL functions, but nothing from my packages). I specifically added these explicit precompile statements to ensure that everything was already compiled.

The result that the @time macro reports is also very small. Nonetheless, the actual run-time is more than 80 seconds. I also tried @profile, but it says it did not collect any samples. Copy pasting that block again, everything is fast.

So my question: where are these 80 seconds going to. What is Julia doing that it had not already done and does not need to do the second time, and that does not show up in @time. I used all possible tricks up my sleeve, but did not get any further insight.

This could be due to auto interpreter heuristics choosing to interpret parts of the first @time block, while the version with the for loop gets compiled.

Try running with julia --compile=min vs julia --compile=all?

I don’t think it’s due to a difference in compiled vs interpreted. Or at the very least, the problem persists when using julia --compile=all (huge difference in timings) but everything is very fast when using --compile=min.

Agreed it’s not clear cut with --compile=all, but I’m not sure whether “all” is really all yet. Given that --compile is an undocumented option I think it’s still a work in progress.

Could there be some issue with functions with many arguments. In TensorOperations.jl, I have several internal functions which have 12 or more arguments. Could it be that it is no longer specializing on those, and therefore has to dynamically decide in the loop which functions to compile. But still, it is not compiling new functions, nor should it be, as everything was already compiled before.

I am still very much struggling, there is one specific line which seams to be a major culprit, a line in a function which is itself never called because it appears in a branch which does not happen. But replacing that line with something else with the same return type makes all the difference.

More weirdness, replace the slow block with

t0 = time();
@time let x = 1
   @tensor plaq[r,c][-1,-2,-3,-4,-5,-6,-7,-8]+=Sz[-1,-2]*Sz[-3,-4]*Id[-5,-6]*Id[-7,-8]
end
@show t = time() - t0

t0 = time();
@time for x = 1:1
   @tensor plaq[r,c][-1,-2,-3,-4,-5,-6,-7,-8]+=Sz[-1,-2]*Sz[-3,-4]*Id[-5,-6]*Id[-7,-8]
end
@show t = time() - t0

The let block executes rapidly, as expected, everything is compiled. The for block, which I believed is in all possible ways equivalent to the let block, and does in no way affect the body, still faces the same problem.

Also, the fact that whatever julia is doing is not showing up in time or profile, seems to indicate that it is not compiling or type inference or these kind of things. It’s also not the macro-expansion,

macroexpand(Main, quote
for r in 1:rows-1
   @tensor plaq[r,c][-1,-2,-3,-4,-5,-6,-7,-8]+=Sz[-1,-2]*Sz[-3,-4]*Id[-5,-6]*Id[-7,-8]
end
end)

executes just fine.

And more weirdness. I reduced the test case dimensionality to 6, to make it slightly more enjoyable to run this. I also took out some non-essential ingredients, and now define a function to make sure global scope was not doing strange things:

using TensorOperations

const Sx=ComplexF64[0 1;1 0];
const Sz=ComplexF64[1 0;0 -1];
const Sy=ComplexF64[0 -1im;1im 0];
const Id=one(Sz);

t0 = time();
@time @tensor eplaq[-1,-2,-3,-4,-5,-6]:=Id[-1,-2]*Id[-3,-4]*Id[-5,-6];
@show t = time() - t0

t0 = time();
@time @tensor eplaq[-1,-2,-3,-4,-5,-6]+=Sz[-1,-2]*Sz[-3,-4]*Id[-5,-6];
@show t = time() - t0
precompile(Tuple{typeof(TensorOperations.checked_similar_from_indices), Array{Base.Complex{Float64}, 4}, Type{Base.Complex{Float64}}, Tuple{Int64, Int64}, Tuple{Int64, Int64}, Tuple{Int64, Int64, Int64, Int64}, Tuple{}, Array{Base.Complex{Float64}, 2}, Array{Base.Complex{Float64}, 2}, Symbol, Symbol});
precompile(Tuple{typeof(TensorOperations.checked_similar_from_indices), Array{Base.Complex{Float64}, 4}, Type{Base.Complex{Float64}}, Tuple{Int64, Int64}, Tuple{Int64, Int64}, Tuple{Int64, Int64, Int64, Int64}, Array{Base.Complex{Float64}, 2}, Array{Base.Complex{Float64}, 2}, Symbol, Symbol});
precompile(Tuple{typeof(TensorOperations.checked_similar_from_indices), Array{Base.Complex{Float64}, 6}, Type{Base.Complex{Float64}}, Tuple{Int64, Int64}, Tuple{Int64, Int64, Int64, Int64}, Tuple{Int64, Int64, Int64, Int64, Int64, Int64}, Tuple{}, Array{Base.Complex{Float64}, 2}, Array{Base.Complex{Float64}, 4}, Symbol, Symbol});
precompile(Tuple{typeof(TensorOperations.checked_similar_from_indices), Array{Base.Complex{Float64}, 6}, Type{Base.Complex{Float64}}, Tuple{Int64, Int64}, Tuple{Int64, Int64, Int64, Int64}, Tuple{Int64, Int64, Int64, Int64, Int64, Int64}, Array{Base.Complex{Float64}, 2}, Array{Base.Complex{Float64}, 4}, Symbol, Symbol});
precompile(Tuple{typeof(Base.:(*)), Base.Complex{Float64}, Int64});

println("run in macroexpand")
t0 = time();
@time macroexpand(Main, quote
for r in 1:rows-1
   @tensor eplaq[-1,-2,-3,-4,-5,-6]+=Sz[-1,-2]*Sz[-3,-4]*Id[-5,-6]
end
end)
@show t = time() - t0

println("run in let block")
t0 = time();
@time let x = 1
   @tensor eplaq[-1,-2,-3,-4,-5,-6]+=Sz[-1,-2]*Sz[-3,-4]*Id[-5,-6]
end
@show t = time() - t0

println("define function")
t0 = time();
@time function f!(eplaq, Sz, Id)
	@tensor eplaq[-1,-2,-3,-4,-5,-6]+=Sz[-1,-2]*Sz[-3,-4]*Id[-5,-6]
end
@show t = time() - t0

println("run function")
t0 = time();
@time f!(eplaq, Sz, Id)
@show t = time() - t0

println("run function in for loop")
t0 = time();
@time for x in 1:1
	f!(eplaq, Sz, Id)
end
@show t = time() - t0

spits out

 26.860512 seconds (99.85 M allocations: 4.602 GiB, 8.66% gc time)
t = time() - t0 = 30.359145879745483
  0.002502 seconds (67 allocations: 4.203 KiB)
t = time() - t0 = 0.5456759929656982
run in macroexpand
  0.000806 seconds (1.08 k allocations: 66.234 KiB)
t = time() - t0 = 0.0018911361694335938
run in let block
  0.000036 seconds (25 allocations: 1.672 KiB)
t = time() - t0 = 0.0030088424682617188
define function
  0.000044 seconds (50 allocations: 4.358 KiB)
t = time() - t0 = 0.002705097198486328
run function
  0.012294 seconds (46.92 k allocations: 2.503 MiB)
t = time() - t0 = 0.013460159301757812
run function in for loop
  0.000032 seconds (17 allocations: 1.063 KiB)
t = time() - t0 = 12.017101049423218

So running the function, with had already run and is thus compiled, in a for loop, with the exact same arguments, suddenly yields 12 secondsof unexplicable time, which does not show up in the @time macro or in @profile.

A slight modification to that code

T1=typeof(eplaq);
T2=typeof(Sz);
T3=typeof(Id);

@time function f!(eplaq::T1, Sz::T2, Id::T3)
@tensor eplaq[-1,-2,-3,-4,-5,-6]+=Sz[-1,-2]*Sz[-3,-4]*Id[-5,-6]
end

removes the difference in timing. Can we find out how much time is spend doing type inference?

Just to explain some of the phenomena seen here, when we see a top-level expression containing a loop we infer and optimize it before running it. That time is not reflected by @time or @profile since the code that starts the timer or profiler has not started running yet; we’re compiling it. Doing @time @eval <code> would probably include compilation time. When we infer top-level expressions like this, if they refer to global variables then we’ll end up inferring code on abstract types, which is usually slower. That is a problem even if the function called inside has run before, since while we may have executed the function before we haven’t necessarily inferred it on the abstract types used by the top-level expression we’re compiling.

Perhaps we should adjust this heuristic; e.g. if the loop refers to a non-constant global then don’t try to optimize it?

4 Likes

Thanks for the response, that makes a lot of sense. I find that the code generated by the @tensor macro in my TensorOperations.jl package does not play well with type inference on abstract types. My colleague @maarten_van_damme has a codebase, not particularly abnormally large, where it takes 2 and half hours to compile the first function call (a chain of functions all relying on several of my packages), whereas the second one then only takes a fraction of a second.

That is indeed impressive :smile:

Looking into this, it seems to be hitting what I’ve been calling a “type inference wild goose chase”, where it’s inferring an explosion of methods that aren’t actually needed. The same thing happens in Plots.jl. I’m exploring some ways to improve the situation.

14 Likes