RFC: a "workaround" for the multi-project precompilation cache problem without long-term code debt

tkf · March 23, 2019, 2:58am

Many of us now realized that it is a pain to use multiple projects because switching projects invoke precompilation:

github.com/JuliaLang/julia

How precompile files are loaded need to change if using multiple projects are going to be pleasant

opened 11:31AM - 04 Jun 18 UTC

closed 06:08PM - 16 Aug 19 UTC

KristofferC

packages

Precompile files are currently stored only based on the UUID of the package. So… if you change your project it is likely that you will have to recompile everything. And then again when you swap back etc. This will be very annoying for people trying to use multiple packages and people will likely just use one mega project like before. https://github.com/JuliaLang/julia/pull/26165 also removed any possibility for users to change the precompile path so there is no way to workaround this right now. We should be smarter how we save precompile file to reduce the amount of recompilation needed. A very simple system is to just use one precompile directory for each project but that might be a bit wasteful since it is theoretically possible to share compilation files between projects.

I suggested a “workaround” a while ago but it never got much attention:

github.com/JuliaLang/julia

Suggestion: Use different precompilation cache path for different system image

JuliaLang:master ← tkf:slug-image

opened 01:08AM - 03 Nov 18 UTC

tkf

+4 -1

With PackageCompiler.jl, it is easy to create custom system images. I think it …would be useful if you can create a custom image for some Julia projects you use repeatedly. However, it is hard to use different system images since all the precompilation cache files are stale after switching the system image. This patch makes it possible to use dedicated set of cache files for each system image. Each system image is identified by its path so that precompilation cache files used for Julia master are updated in-place after you pulled and build Julia (as oppose to increase disk usage for each pull). (As a side-effect, this patch can be used as a workaround of #27418.)

So, allow me to advertise it again here.

I propose to add a single line in the code determining the path for the precompilation cache path (i.e., *.ji file) so that it is different for different system image sys.so path. This is not the solution to #27418 per se. However, you can create a system image for each frequently used project using PackageCompiler.jl to avoid precompilation when switching the projects. Thus, the change I proposed can be used as a workaround for #27418. Note that, although I introduced it as a “workaround”, I think you’d want to have this workflow anyway to use PackageCompiler.jl with multiple projects. If we have this feature in Julia 1.2, we can implement a tool for this workflow on top of PackageCompiler.jl (or maybe include such tool in it).

As @StefanKarpinski mentioned, this is not the direct solution to #27418 but it is rather a strict subset of the solution. This is why I think there is no “long-term code debt” in this approach. We need to take the system image into account for the path to the precompilation cache anyway because the system image may include the packages imported by the package that is precompiled.

What do you think? Does this approach make sense?

anon67531922 · March 23, 2019, 3:59am

Out of curiosity, how difficult would it be to create a PR implementing this suggestion? I think suggestions tend to get more attention / traction if there is a PR to go along with it. I don’t know if that is feasible, but would probably help if it were.

tkf · March 23, 2019, 4:16am

It’s already implemented. What I linked above was a PR:

github.com/JuliaLang/julia

Suggestion: Use different precompilation cache path for different system image

JuliaLang:master ← tkf:slug-image

opened 01:08AM - 03 Nov 18 UTC

tkf

+4 -1

With PackageCompiler.jl, it is easy to create custom system images. I think it …would be useful if you can create a custom image for some Julia projects you use repeatedly. However, it is hard to use different system images since all the precompilation cache files are stale after switching the system image. This patch makes it possible to use dedicated set of cache files for each system image. Each system image is identified by its path so that precompilation cache files used for Julia master are updated in-place after you pulled and build Julia (as oppose to increase disk usage for each pull). (As a side-effect, this patch can be used as a workaround of #27418.)

anon67531922 · March 23, 2019, 4:23am

Right. Sorry

By the title, it looked like an issue

It’s outside my area, so I can’t comment on the merits. Definitely sounds like a problem worth solving so hopefully progress is made

tkf · March 23, 2019, 4:42am

No worries! Others might have the same question. It’s nice that it’s now clear that there is already an implementation All we need to do is to decide if it is a good idea.

jpsamaroo · March 23, 2019, 12:50pm

I’m in favor, although I’d like to also see at least a plan laid out for how to solve this problem of unnecessary precompiles without using PackageCompiler. PackageCompiler is nice, but also has its limitations and doesn’t always work, so making that a requirement for “solving” this issue isn’t going to go well for many situations.

If a plan already exists or you have an idea of how to approach that, I’d appreciate hearing the details!

Edit: Btw, please don’t take this post as me being against your PR’s purpose until a full solution is in place; I’m still 100% for anything which improves upon the current situation.

oxinabox · March 23, 2019, 10:58pm

This seems like a broadly speaking inpractical work-around.
I have dozens enviroments on the go at a time – it is one of the nice things about Pkg3,
and I will often create new ones and discard them rapidly.

Creating a new system image for each is not practical.
Paricularly since when I update that would need to be redone.

I’m not so much against it, as of the opinion that I have already spent more time writing this response than it will actually save most people in practice.
Certainly more than it will save me.

tkf · March 24, 2019, 4:26am

I think the long-term approach would be to use hash tree (Merkle tree) to generate the path of precompilation cache file. That is to say, the hash value of a given package depends on:

“Content” of the package which may be represented by one of the following:
- (Option A) its UUID, version, and package options
- (Option B) paths of all source files and files specified by include_dependency
Hash value of all dependencies.

Given the hash value, the path to the compilation cache would be ~/.julia/compiled/v$X.$Y/$package_name/$digest_of_the_hash.ji as done today.

I think the hardest part is how to retrieve the hash value of all dependencies. This is difficult because you have to know the dependencies before start loading the package (kind of a chicken-and-egg problem). It’s difficult especially because dependency tree in Julia is dynamic; i.e., you can (de)activate projects in the middle of session. IIUC, this may choose different set of versions of packages depending on the order of imports and (de)activations. So, I think you need to build in-memory dependency tree and propagate it to subprocesses that precompile the packages.

Alternatively, maybe you can remove Pkg.activate (or highly restrict its usage) so that the whole dependency tree is statically determined by Manifest.toml (provided that it’s not modified by other processes). With this approach you need to add Manifest.toml loader in Base (which probably requires to add Base.Toml module). Another way to retrieve dependencies is to cache the dependencies recorded during precompilation in a persistent database file for each project.

tkf · March 24, 2019, 4:38am

If you are happy with your current workflow, my PR will not make it difficult to stick with it. It just creates opportunities for improving other workflows.

tkf · March 24, 2019, 5:22am

I realized I didn’t motivate why hash tree is a good approach. For example, why not include HOME_PROJECT[] in the data from which the hash is computed [1]? Hash tree approach is better because, when your multiple projects share a dependency subtree, it gets shared automatically and there is no need for precompilation even for newly created projects. This is a big help for short-lived projects as in the workflow @oxinabox explained.

[1] ~~Side note: It may be a good enough short-term solution (which is orthogonal to what I am proposing here). But simply doing so didn’t work because it is reset in load_path_setup_code, IIRC.~~

Edit: there is also LOAD_PATH which complicates the problem further

oxinabox · March 24, 2019, 9:51am

Indeed, that is why I said:

I’m not so much against it

It is just without impact to me.
As you said what we need is something that stores based on the particular environment stack of manifests.toml

Balinus · March 24, 2019, 2:04pm

Full disclosure: I’m far from having enough tech-skills to understand the whole threads. With that being said, I understand that this PR would ease the communication between PyCall and pyjulia, by sharing system image. Just wanted to chime in by saying that this is a rather strategic use-case for quietly implementing Julia in Python project.

For instance, I presented Julia at a big company (this is actually one of the biggest hydro-electric company on earth and the biggest in North America). The outcome of the discussion is: the easiest entry point for new language is to wrap it up inside another language that is already accepted. Once this is done, it becomes another accepted language and new projects can then use it from the get-go.

Sorry if my post is not relevant to the discussion, just trying to provide a strategic view on the subject.

tkf · March 25, 2019, 5:59am

@Balinus Thanks for bringing another perspective to this.

As a bit of background, see Idea: use PackageCompiler.jl to avoid the precompilation cache nightmare? · Issue #217 · JuliaPy/pyjulia. The idea is to create a system image dedicated to PyJulia usage. This makes Julia runtime automatically use precompilation cache dedicated to PyJulia and hence avoids the major issue in PyJulia.

anon67531922 · March 25, 2019, 7:33am

I’m not following a long (got a lot on my plate right now), but making a change to Julia to support PyJulia? And that change has a dependency on PackageCompiler.jl?

Can this be handled in a separate package?

tkf · March 25, 2019, 8:37am

There is nothing specific to PyJulia in my proposal. I’m just describing another nice side-effect from this change.

jpsamaroo · March 25, 2019, 8:43pm

This is pretty much what I was hoping for. I think you’ve got a pretty solid plan here, which is only slightly foiled by activate.

A solution to this problem could be to first make precompilation dependent on the static dependency tree defined in the project, and not worrying about further activates until they happen. I see this as being acceptable because activate somewhat breaks the concept of a self-contained project as being atomic, and even allows packages which are already loaded to change versions from the perspective of code loading (even if you can’t actually load them).

Therefore, I think initially we should focus on just precompiling each project which is loaded in isolation, before any further activates occur. That way, the common case of loading and using only the first activated project would work well and be fast (common cases in my mind being running package tests, and starting a REPL with the --project flag). Note that we’ll also want separate precompilation directories for any targets like testing, since tests can contain additional dependencies.

We’d need to potentially get more fancy for multiple activates, which could be handled by creating “merged projects” which are pseudo-projects that have been precompiled with respect to the dependency trees of multiple projects (often a package’s project + one of the v1.x projects when doing development). This might be harder to do, but is also in my mind an edge case in the entire code loading scheme, so I believe it deserves special treatment anyway.

tkf · March 26, 2019, 2:46am

I replied to your comment in the github issue since it’s more directly related to the original issue. I think it helps us focus on the actual topic here (Should precompilation cache path depend on system image? Should it be implemented before full solution?).

jpsamaroo · March 26, 2019, 11:38am

Awesome thanks! And sorry if I wasn’t clear before, but I consider your proposed change for making precompilation path depend on sysimg to be an obvious addition. IMO it’s a good, simple solution to a common case for many people.

Edit: Another thing to consider, is what to do about precompile files accumulating on the user’s system. It might be good to document where those files are found so that the user at least knows how to manually delete them.

tkf · March 26, 2019, 11:23pm

Oh, you are very clear in the first post that you are positive about this proposal.

I think it’s hard for users to know which file corresponds to which environment. The path is currently something like ~/.julia/compiled/v1.2/PyCall/GkzkC.ji where GkzkC means nothing to humans. I think the better approach would be to add GC (sub)command in Pkg.jl to wipe out old cache. It also minimize API surface; i.e., we can change the path system later without breaking compatibility.

Topic		Replies	Views
Project level precompile caching? General Usage package	1	671	February 13, 2019
Frequent (and annoying) precompilation when using several Pkg environments General Usage precompilation	3	662	March 17, 2019
ANN: a solution to the precompilation problem: JuliaManager.jl / jlm CLI, a system image manger for Julia Package Announcements compilation , precompilation	0	1215	April 6, 2019
How to delete precompile cache? General Usage	19	12118	August 19, 2018
Julia and IJulia conflict precompile General Usage question	4	909	December 10, 2018

RFC: a "workaround" for the multi-project precompilation cache problem without long-term code debt

Related topics