I have a function that runs very fast but is slow to compile. I have tried using SnoopCompile functions @snoop_inference and @snoop_llvm, but they don’t seem to show where the large latency is residing. I’m new to compilation profiling, so I would appreciate the advice of those with more knowledge.
To be concrete: For a length-200k input vector, the first run takes 4.1s; each subsequent execution time is sub-ms. Thus I conclude that compile time is essentially 4.1s.
Next, in a fresh session I run @snoop_inference on the function as in the tutorial. The root node has exclusive/inclusive times 4.0s/4.1s. According to the reference, the exclusive time for the root “corresponds to the time spent not in julia’s type inference (codegen, llvm_opt, runtime, etc).” Evidently, only 0.1s is being spent on inference, and 4.0s is spent on something else.
Finally, in yet another fresh session I run @snoop_llvm following the tutorial. The resulting “times” add up only to only 0.9s. (I presume these numbers are absolute times measured in seconds, and are not fractions? It doesn’t seem to be documented.) Thus, approximately 3.1s of the 4.1s are unaccounted for.
Is my reasoning valid? What can I try next to find where most of the compilation time is being spent?
Well the easiest thing would be if you could share the code with us
Which Julia version are you on? Do you use a lot of complicated types, compile time computations, macros or @generated functions?
There are also other approaches to work around long compile times. You could e.g. put your code into a package and then use precompilation to pay the compilation cost only once when you add the package to some environment.
I’m on Julia 1.10.2. The code does not do anything too fancy, but it does use large tuples for these inputs which I suspect has something to do with it. What’s odd is that some of the functions involving the same large tuples compile slowly and others do not. Precompilation may indeed be helpful here.
I could (and later might) share the code but I didn’t want the discussion to get bogged down in the particulars. Really I’m just trying to (1) have the community either confirm or correct my reasoning about compilation and what SnoopCompile measures, and (2) educate me on other stages of compilation that might be relevant and how they could be tested. That way I will have better understanding to solve other problems that may arise in the future.
Large tuples might be the problem, since the length of the tuple is a compile time constant, so every function is newly compiled (possibly unrolling loops) depending on how many elements the tuple has. If the tuple has more than one type, its probably even worse because the compiler keeps track of all the different types in each position.
Is there a particular reason in favor of large tuples instead of a vector? (I think even tuples are heap allocated if they become too large anyway).
It might be a good idea to see if the problem remains after switching to a vector, and if that indeed has a large runtime overhead
The tuples are an integral part of the code base. It would take some work to replace the tuples with vectors. Since not all the functions involving these tuples compile slowly, I was hoping there might be some other tool or way of diagnosing exactly which part of the compilation was taking the missing 3.1 second.