New tools for reducing compiler latency

tim.holy · January 5, 2021, 3:26pm

I’m pleased to announce the release of SnoopCompile 2.2.0, a major enhancement in the suite of tools for reducing “time to first plot.” This release provides considerably better insight into the costs of type inference, allowing you to profile the actions of inference in extensive detail. It also provides a new suite of tools for detecting and analyzing “inference triggers,” cases where a previously-uninferred method is called via runtime dispatch.

As a quick visual for the new release, I’ll paste the following image from SnoopCompile’s new documentation:

This flamegraph may be familiar to those of you who have used ProfileView or similar tools to improve runtime performance; the image above is a flamegraph for type-inference, allowing you to answer the questions, “which methods are being inferred with what types?” and “how long does this take?” In this case, the red bars correspond to “non-precompilable” (though there are tricks to change that!) methods owned by Julia or other packages.

There is also a blog post to accompany the release that explains most of the fundamentals. (Forthcoming posts will be expanded versions of material that’s already in the SnoopCompile docs, so you don’t have to wait…) I’m also pleased to report that I’ve used this to cut latency to less than half its original value for a half-dozen or so packages that I use, and there are other happy users as well.

This is the latest step in the long-running campaign to reduce Julia’s long latencies. While using this tool involves more work on the part of package developers than other steps taken during the Julia 1.6 release, it also provides you with greater insight into how the design of your packages interacts with Julia’s compiler.

I’d like to thank @NHDaly who kicked this off with some changes to Core.Compiler and SnoopCompile that made this all possible.

Happy latency-stomping!

tim.holy · January 28, 2021, 10:10am

@NHDaly and I are pleased to announce the second in the series of blog posts on tools for latency reduction: Profiling type-inference

This post introduces the core new tool for profiling type-inference. Just knowing where type inference spends its time can be pretty informative, so this may already be useful for some developers. Future posts will introduce a larger suite of tools that provide additional analysis on the output of @snoopi_deep, so stay tuned!

joa-quim · January 28, 2021, 6:02pm

OK, I gave my package onother try to lower those 4 seconds after a precompile

using SnoopCompile

precompile(plot, (Array{Float64,2},));

tinf = @snoopi_deep plot([0.0 0.; 1.1 1.])
InferenceTimingNode: 3.972431/4.182966 on InferenceFrameInfo for Core.Compiler.Timings.ROOT() with 40 direct children

invalidations = @snoopr plot([0.0 0; 1.1 1])
Any[]

So they don’t come from inference problems right? exclusive time is 3.97 out of 4.18
Not sure if I did the correct check with @snoopr but it deoesn’t seem to be from invalidations. So where is that time being used?

Thanks

tim.holy · January 28, 2021, 6:18pm

That’s not precisely the recommended way to use the tools. Here’s a step-by-step analysis of what was going on behind the scenes of your session:

precompile(plot, (Array{Float64,2},));

This will cause inference to run on plot, with a Matrix{Float64} argument. This finishes inferring everything that can be inferred about that call.

tinf = @snoopi_deep plot([0.0 0.; 1.1 1.])

Now you’re actually executing the plot command while snooping on inference. The different between plot(data) and precompile(plot, (typeof(data),)) is the following: if inference can’t predict the type of some object from the type alone, it stops “exploring the call graph” any further in that direction. Whereas if you have data, then Julia just calls whatever operation on it and then asks, “ok, what type of object do I have”? And then it continues executing the call graph from there.

So what that means is that the precompile statement will “explore the call graph” for as far as it can, but then give up; plot(data) traverses the entire call graph.

Now, you’re only snooping on the second part. Consequently, your tinf only includes inference that could not be deduced from the initial statement—i.e., all the inference that happened except what could be inferred from precompile(plot, Matrix{Float64}). That’s not really the recommended way to use the tool, but there’s nothing really wrong with it either.

@snoopr plot([0.0 0; 1.1 1])

This will do nothing unless plot defines new methods. All the invalidations already happened when you loaded whatever package plot was defined in.

Invalidations ≢ inference. Invalidations (@snoopr) happen when you define new methods that supersede the priority of previous methods. Inference (@snoopi_deep) is what julia uses to predict types and generate fast code. But the two are coupled because code with poor inference (1) is at risk for invalidation, (2) tends to run slowly because Julia generates bad code, and (3) can’t be precompiled effectively. Historically we’ve had good tools for (2) with @profile; think of @snoopr as a profiler for (1) and @snoopi_deep partly as a profiler for (3).

tim.holy · January 28, 2021, 6:22pm

Oh, and to answer the question: the amount of time spent on inference was 4.18-3.97 = 0.21s. Everything else was likely codegen (compilation steps that take place after inference finishes) and any time to actually produce the plot.

What that means is that only 0.21s of the necessary inference could not be deduced from the initial plot command. But I bet you’ll see a larger inference time (larger difference between the exclusive and inclusive) if you don’t first say precompile(plot, (Matrix{Float64},)).

As of today, though, that 0.21s is an upper bound on the amount of time you could save by solving inference problems or forcibly precompiling the “unreachable” MethodInstances. The new release of SnoopCompile has much better tools for finding the “precompilable nuggets” than does the old @snoopi, but the blog posts haven’t gotten that far yet. (You can read the SnoopCompile docs, though, for the impatient there is no need to wait.)

In the long run, it may be hoped that Julia will start caching more than just inference results, in which case that some portion of that 3.97s would also be retrievable.

joa-quim · January 28, 2021, 6:43pm

Thanks, this crystal clear now for me. And you are right that without any precompilation the difference is way larger.

julia> tinf = @snoopi_deep plot([0.0 0.; 1.1 1.])
InferenceTimingNode: 4.749361/17.162196 on InferenceFrameInfo for Core.Compiler.Timings.ROOT() with 32 direct children

julia> tinf = @snoopi_deep plot([0.0 0.; 1.1 1.])
InferenceTimingNode: 0.014566/0.014664 on InferenceFrameInfo for Core.Compiler.Timings.ROOT() with 1 direct children

I run the command twice to show that the plot command runs very fast, after Julia gets satisfied with the compilation.

Am I misinterpreting your words if I conclude that those 0.21 s left from second inference after precompile are then responsible for most of those ~4 seconds?

tim.holy · January 28, 2021, 8:41pm

That’s right: only 0.21s of inference time, but ~4s of codegen time. However, in the case without precompilation, notice it’s basically flipped: ~12s of inference time, and ~5s for the rest. So type inference is most of the (original) problem. Fortunately you can cache the results.

joa-quim · January 28, 2021, 10:26pm

I’ve spend days trying out things. Type signed almost all functions arguments but for versatility sake I cannot escape to have a typeless arg1 and apparently situations like this smear Anys.

function common_plot_xyz(cmd0::String, arg1, caller::String, first::Bool, is3D::Bool, kwargs...)

For example, from an error that I managed to trigger, the cmd in this call (which can only be a String - commented L201) managed to arrive in the callee function as a Any

github.com

GenericMappingTools/GMT.jl/blob/master/src/psxy.jl#L230


      
          
          
	arg1, opt_S = parse_opt_S(d, arg1, is3D)
          
          
	opt_ML::String = ""
          	if (opt_S != "")
          		opt_ML, opt_Wmarker = parse_markerline(d, opt_ML, opt_Wmarker)
          	end
          	(made_it_vector && opt_S == "") && (cmd *= " -Sv+s")	# No set opt_S because it results in 2 separate commands
          
          
	# See if any of the scatter, bar, lines, etc... was the caller and if yes, set sensible defaults.
          	cmd  = check_caller(d, cmd, opt_S, opt_W, sub_module, g_bar_fill, O)
          	(mcc && caller == "bar" && !got_usr_R && opt_R != " -R") && (cmd = recompute_R_4bars!(cmd, opt_R, arg1))	# Often needed
          	_cmd = build_run_cmd(cmd, opt_B, opt_Gsymb, opt_ML, opt_S, opt_W, opt_Wmarker, opt_UVXY, opt_c)
          
          
	(got_Zvars && opt_S == "" && opt_W == "" && !occursin(" -G", _cmd[1])) && (_cmd[1] *= " -W0.5")
          
          
	# Let matrices with more data columns, and for which Color info was NOT set, plot multiple lines at once
          	arg1 = helper_multi_cols(d, arg1, mcc, opt_R, opt_S, opt_W, caller, is3D, multi_col, _cmd,
          	                         sub_module, g_bar_fill, got_Ebars, got_usr_R)
          
          
	# Try to limit the damage of this Fker bug in 6.2.0

Very hard thing this is.

Topic		Replies	Views
Profiling compilation/inference Performance inference	10	1405	August 24, 2020
What next after SnoopCompile? Performance	7	853	March 20, 2023
Understanding precompilation and its limitations (reducing latency) Internals & Design precompilation	13	2679	December 31, 2020
Bad performance for dispatch-heavy code Performance question	12	1688	September 20, 2019
Measuring time of type inference Internals & Design	8	892	March 25, 2019

New tools for reducing compiler latency

Related topics