Building a PC optimized for "time to first plot"

Even bigger issue is that Pluto doesn’t work with custom system images last time I checked. So yes, Pluto is currently left out of the solutions right now.

1 Like

I think one problem is that using system images goes right against the principle to use a new Project.toml for each separate project and not rely on one (or a few) global environments. Also, it doesn’t help at all when developing a package, which I do most of the time. For me, what would help the most is native code caching that stays intact as much as possible when making changes to the Project.toml. I would already be using sysimages if my work was “static” enough for that.

2 Likes

It does. You don’t need to system image the package you’re developing. Plots, StaticArrays, FillArrays, etc. A few basic pieces chops out a ton of the compile time.

1 Like

Using Pkg.activate, it is possible to use any environments for Pluto notebooks dependencies, and hence to benefit from PackageCompiler sysimages.

My usual workflow, when dealing with multiple notebooks sharing the same packages :

  • Create a notebook depencencies specific environment (e.g. MyExcitingProjectInPluto) excluding Pluto.jl itself.
  • Build a sysimage including this environment packages using PackageCompiler.jl. I may optionally use a precompile scrip (doc) if heavy compile time packages are not snooped as Chris explained (i.e. first function calls still have a significant overhead with the sysimage in practice).
  • Launch the Pluto session (Pluto.run()) with sysimage="path/to/sysimage.so" keyword argument.
  • When creating a new notebook, activate MyExcitingProjectInPluto project before importing any package.

And I’m good to go brrr !
Hope that helps.

4 Likes

It’s definitely good to hear that there have been significant improvements in using time and first run time. Though trying to run Acausal Component-Based Modeling the RC Circuit · ModelingToolkit.jl still takes a minute to just load the packages and only a few seconds to do the solve.

$ time julia --project acausal.jl 
  2.462438 seconds (3.55 M allocations: 248.002 MiB, 3.49% gc time, 100.00% compilation time)

real	1m55,004s
user	1m51,727s
sys	0m4,015s

The improved sysimage integration seems more promising, and definitely something I will play with. It’s been compiling the image for a few minutes already as I write this.

Meanwhile, I think the question what is a good computer to load Julia code as fast as possible is still a valid question. You can hopefully avoid doing it every time you make a change, but at some point you’ll have to compile stuff and making it fast will keep me in the flow more.

Someone on Twitter suggested toplev, which can give you the bottleneck of a piece of code. And it seems like the above acausal example is completely bottlenecked on memory latency and bandwidth. So I would think fast memory and big caches would be what you want in this case, right?

$ toplev -l2 julia --project acausal.jl 
Will measure complete system.
  2.156586 seconds (3.55 M allocations: 248.002 MiB, 3.64% gc time, 100.00% compilation time)
# 4.4-full-perf on Intel(R) Core(TM) i7-9750H CPU @ 2.60GHz [cfl/skylake]
C3    FE             Frontend_Bound                  % Slots                      31.1   [14.3%]<==
C3    FE             Frontend_Bound.Fetch_Latency    % Slots                      16.3   [14.3%]
	This metric represents fraction of slots the CPU was stalled
	due to Frontend latency issues...
	Sampling events:  frontend_retired.latency_ge_16:pp frontend_retired.latency_ge_8:pp
C3    FE             Frontend_Bound.Fetch_Bandwidth  % Slots                      14.6   [14.3%]
	This metric represents fraction of slots the CPU was stalled
	due to Frontend bandwidth issues...
	Sampling events:  frontend_retired.latency_ge_2_bubbles_ge_1:pp frontend_retired.latency_ge_1:pp frontend_retired.latency_ge_2:pp
C3-T0 MUX                                            %                            14.28 
	PerfMon Event Multiplexing accuracy indicator
C3-T1 MUX                                            %                            14.28 
Run toplev --describe Frontend_Bound^ to get more information on bottleneck
Add --run-sample to find locations
Add --nodes '!+Frontend_Bound*/2,+MUX' for breakdown.
Idle CPUs 0-2,4-8,10-11 may have been hidden. Override with --idle-threshold 100

I’m not sure if I can further drill down and find the optimal CPU & RAM choice. I think AMD CPUs tend to have bigger caches than Intel ones, but I’ve also heard the Apple CPUs have insane memory bandwidth. If anyone would care to run the example on their machine it’d be interesting, but in the end I’m not sure this will lead to any useful insight beyond “fast CPU is fast”.

p.s. I wonder why it’s bottlenecked on frontend fetch

p.p.s it took over 10 minutes to make the sysimage, which sped up the using time but still caused large delays on first run, so another 10 minutes later I made a sysimage with the example itself in execution_files which finally made it not infuriatingly slow to run. So it does require a decent amount of fiddling and waiting to get the good startup times, which is once again why I’m looking into what PC would be fast at that.

3 Likes

MTK doesn’t have a snoop precompile block yet. Someone just needs to do it and it would be automatic. Also, the majority of the MTK issue is that it needs the Unityper form to be merged, so just go bug @shashi and you’ll get something that even an iphone will do fine with.

2 Likes

Speaking of Plots, using SnoopPrecompile for precompilation instead of regular Base.precompile statements reduced ttfp by about 25%, which is significant considering SnoopPrecompile is free lunch.

It got introduced a month ago: rework precompilation using `SnoopPrecompile` by t-bltg · Pull Request #4334 · JuliaPlots/Plots.jl · GitHub.

3 Likes

Single core is what currently matters so probably wait and get the fastest of the Intel 13th gen.

7 Likes

Another “better wait” suggestion, based on your remark about cache, is for the upcoming AMD Zen4 refresh with 3D v-cache, supposedly coming out in January. In benchmarks, the 3D v-cache (technology for packaging a lot of extra cache on the CPU) shows Zen3 with v-cache as very competitive with newer Zen4 (without v-cache) and Intel.

2 Likes

It seems that people become quite crazy about TTFX problems. I just wonder why not simply introduce static compilation/semantics into the core language, given the intricate and critical nature of this issue.

A powerful computer won’t help at all, due to the end of Moore’s law. And no everyone can afford a new computer. What’s more, all these temporary workarounds suffer from the unfortunate fact that Julia’s ecosystem hasn’t reached its expected peak, as developers sometimes constraint themselves to not exacerbate the latency problem. The general assumption that Julia in the future can largely eliminate latency by using precompilation/systemimage is inaccurate, since all these improvement will be counterbalanced by the growing codebases, let alone the technical difficulties.

2 Likes

System images are compilation controls in the core language, they are just underused because the tooling used to be awful. The tooling isn’t so bad today though, but the perception change has been lagging.

3 Likes

Julia v1.8 was released less than two months ago, and the tooling’s still catching up (especially for development environments other than VSCode), so I’m not sure perception is really ‘lagging’ unless you’ve been living on bleeding-edge builds.

2 Likes

I believe that’s only part of the story. We already know that the usage of precompilation file is authorized by core term. That’s why many Julian unconditionally advocate precompilation instead of sysimage. Unless sysimage is truly builtin into the language, it deserves its current reputation, no matter how perfect its current toolchain is.

What? It is “truly builtin into the language”. The functionality to build system images is part of core Julia and every Julia binary ships with a system image that includes Base and the standard libraries. The only thing the tooling does is performs the core system image building with a non-standard set of packages.

So it is lagging from the update of about 2 months ago?

“On track for adoption” is the marketing speak :wink:

1 Like

Your own experience is as a core Julia contributor whose daily work is 100% centered around Julia. Most scientists and engineers do not expect their tools to change so rapidly - I recorded an Ansys Fluent tutorial for a course back in 2015, based on Fluent 2014, and it’s only now diverging enough from current Fluent capabilities to require an update.

Julia’s rate of progress is great! But let’s not imply that users are laggards or ignoramuses for treating Julia like a computational tool which they can pick up and put down as their needs require without having to re-learn the heuristics for an efficient workflow.

14 Likes

Can someone summarize shortly what package maintainers should do these days?
Putting some

using SnoopPrecompile 

somewhere in the code per default? Is this it?

1 Like

Yeah. While scientists from diverse fields may have relatively fewer knowledge on programming than a professional CS people, their somehow awkward experience of latency problem and insufficient toolchains is still real. That’s why I say system image is just a temporary solution, since whole system image still requires a long time to build and it’s rather compilcated. What’s more, it’s hard to test and debug. What will happen is I encounter some segfaults in the middile of image building process? Can I circumvent the problematic packages through some readable logs, instead of just reporting issues and waiting for upstream fixes.

Base and standard libraries are quite special as they are carefully written to avoid potential compilation-related problems, like bootstrap, and to support all platforms. But that doesn’t imply that all the packages can work out of box. That’s why I said “truly builtin into the languages”, unless it’s officially supported and even encouraged so that the developers can carefully cultivate the packages to prevent compilation failure, then we user know it’s time to move onto system images.

Yes, standard workflows should be put into snoopprecompile blocks. The blog post:

describes why it’s important in some detail.

No, absolutely not. That’s just factually wrong. They have low compile times by being in the system image. If you omit them from the system image then they would also have long compile times. That’s why some pieces like LinearAlgebra were not removed from the Julia Base build, because they need to be there in order to be in the system image! It was done with SparseArrays and then reversed because of this problem.

2 Likes

I definitely know Base and standard libraries are included in the system image. That’s why using them takes basically zero time.

What you have said just prove what I have said : Base and standard libraries are specially maintained and carefully cultivated, so people won’t get into issues related to system images. For example, you won’t have C pointer or unfinished Task accidentally cached in somewhere. Without the system image you can’t even successfully launch Julia. And core team pays attention to it, while using custom system image is not that lucky.