RFC: a "workaround" for the multi-project precompilation cache problem without long-term code debt

#1

Many of us now realized that it is a pain to use multiple projects because switching projects invoke precompilation:

I suggested a “workaround” a while ago but it never got much attention:

So, allow me to advertise it again here.

I propose to add a single line in the code determining the path for the precompilation cache path (i.e., *.ji file) so that it is different for different system image sys.so path. This is not the solution to #27418 per se. However, you can create a system image for each frequently used project using PackageCompiler.jl to avoid precompilation when switching the projects. Thus, the change I proposed can be used as a workaround for #27418. Note that, although I introduced it as a “workaround”, I think you’d want to have this workflow anyway to use PackageCompiler.jl with multiple projects. If we have this feature in Julia 1.2, we can implement a tool for this workflow on top of PackageCompiler.jl (or maybe include such tool in it).

As @StefanKarpinski mentioned, this is not the direct solution to #27418 but it is rather a strict subset of the solution. This is why I think there is no “long-term code debt” in this approach. We need to take the system image into account for the path to the precompilation cache anyway because the system image may include the packages imported by the package that is precompiled.

What do you think? Does this approach make sense?

4 Likes

ANN: a solution to the precompilation problem: JuliaManager.jl / jlm CLI, a system image manger for Julia
PyTorch and Julia
#2

Out of curiosity, how difficult would it be to create a PR implementing this suggestion? I think suggestions tend to get more attention / traction if there is a PR to go along with it. I don’t know if that is feasible, but would probably help if it were.

0 Likes

#3

It’s already implemented. What I linked above was a PR:

1 Like

#4

Right. Sorry :slight_smile:

By the title, it looked like an issue :sweat_smile:

It’s outside my area, so I can’t comment on the merits. Definitely sounds like a problem worth solving so hopefully progress is made :pray:

0 Likes

#5

No worries! Others might have the same question. It’s nice that it’s now clear that there is already an implementation :slight_smile: All we need to do is to decide if it is a good idea.

2 Likes

#6

I’m in favor, although I’d like to also see at least a plan laid out for how to solve this problem of unnecessary precompiles without using PackageCompiler. PackageCompiler is nice, but also has its limitations and doesn’t always work, so making that a requirement for “solving” this issue isn’t going to go well for many situations.

If a plan already exists or you have an idea of how to approach that, I’d appreciate hearing the details! :slight_smile:

Edit: Btw, please don’t take this post as me being against your PR’s purpose until a full solution is in place; I’m still 100% for anything which improves upon the current situation.

1 Like

#7

This seems like a broadly speaking inpractical work-around.
I have dozens enviroments on the go at a time – it is one of the nice things about Pkg3,
and I will often create new ones and discard them rapidly.

Creating a new system image for each is not practical.
Paricularly since when I update that would need to be redone.

I’m not so much against it, as of the opinion that I have already spent more time writing this response than it will actually save most people in practice.
Certainly more than it will save me.

1 Like

#8

I think the long-term approach would be to use hash tree (Merkle tree) to generate the path of precompilation cache file. That is to say, the hash value of a given package depends on:

  1. “Content” of the package which may be represented by one of the following:

  2. Hash value of all dependencies.

Given the hash value, the path to the compilation cache would be ~/.julia/compiled/v$X.$Y/$package_name/$digest_of_the_hash.ji as done today.

I think the hardest part is how to retrieve the hash value of all dependencies. This is difficult because you have to know the dependencies before start loading the package (kind of a chicken-and-egg problem). It’s difficult especially because dependency tree in Julia is dynamic; i.e., you can (de)activate projects in the middle of session. IIUC, this may choose different set of versions of packages depending on the order of imports and (de)activations. So, I think you need to build in-memory dependency tree and propagate it to subprocesses that precompile the packages.

Alternatively, maybe you can remove Pkg.activate (or highly restrict its usage) so that the whole dependency tree is statically determined by Manifest.toml (provided that it’s not modified by other processes). With this approach you need to add Manifest.toml loader in Base (which probably requires to add Base.Toml module). Another way to retrieve dependencies is to cache the dependencies recorded during precompilation in a persistent database file for each project.

1 Like

#9

If you are happy with your current workflow, my PR will not make it difficult to stick with it. It just creates opportunities for improving other workflows.

1 Like

#10

I realized I didn’t motivate why hash tree is a good approach. For example, why not include HOME_PROJECT[] in the data from which the hash is computed [1]? Hash tree approach is better because, when your multiple projects share a dependency subtree, it gets shared automatically and there is no need for precompilation even for newly created projects. This is a big help for short-lived projects as in the workflow @oxinabox explained.

[1] Side note: It may be a good enough short-term solution (which is orthogonal to what I am proposing here). But simply doing so didn’t work because it is reset in load_path_setup_code, IIRC.

Edit: there is also LOAD_PATH which complicates the problem further

0 Likes

#11

Indeed, that is why I said:

I’m not so much against it

It is just without impact to me.
As you said what we need is something that stores based on the particular environment stack of manifests.toml

0 Likes

#12

Full disclosure: I’m far from having enough tech-skills to understand the whole threads. With that being said, I understand that this PR would ease the communication between PyCall and pyjulia, by sharing system image. Just wanted to chime in by saying that this is a rather strategic use-case for quietly implementing Julia in Python project.

For instance, I presented Julia at a big company (this is actually one of the biggest hydro-electric company on earth and the biggest in North America). The outcome of the discussion is: the easiest entry point for new language is to wrap it up inside another language that is already accepted. Once this is done, it becomes another accepted language and new projects can then use it from the get-go.

Sorry if my post is not relevant to the discussion, just trying to provide a strategic view on the subject.

1 Like

#13

@Balinus Thanks for bringing another perspective to this.

As a bit of background, see Idea: use PackageCompiler.jl to avoid the precompilation cache nightmare? · Issue #217 · JuliaPy/pyjulia. The idea is to create a system image dedicated to PyJulia usage. This makes Julia runtime automatically use precompilation cache dedicated to PyJulia and hence avoids the major issue in PyJulia.

1 Like

#14

I’m not following a long (got a lot on my plate right now), but making a change to Julia to support PyJulia? :thinking: And that change has a dependency on PackageCompiler.jl? :thinking:

Can this be handled in a separate package?

0 Likes

#15

There is nothing specific to PyJulia in my proposal. I’m just describing another nice side-effect from this change.

2 Likes

#16

This is pretty much what I was hoping for. :smile: I think you’ve got a pretty solid plan here, which is only slightly foiled by activate.

A solution to this problem could be to first make precompilation dependent on the static dependency tree defined in the project, and not worrying about further activates until they happen. I see this as being acceptable because activate somewhat breaks the concept of a self-contained project as being atomic, and even allows packages which are already loaded to change versions from the perspective of code loading (even if you can’t actually load them).

Therefore, I think initially we should focus on just precompiling each project which is loaded in isolation, before any further activates occur. That way, the common case of loading and using only the first activated project would work well and be fast (common cases in my mind being running package tests, and starting a REPL with the --project flag). Note that we’ll also want separate precompilation directories for any targets like testing, since tests can contain additional dependencies.

We’d need to potentially get more fancy for multiple activates, which could be handled by creating “merged projects” which are pseudo-projects that have been precompiled with respect to the dependency trees of multiple projects (often a package’s project + one of the v1.x projects when doing development). This might be harder to do, but is also in my mind an edge case in the entire code loading scheme, so I believe it deserves special treatment anyway.

2 Likes

#17

I replied to your comment in the github issue since it’s more directly related to the original issue. I think it helps us focus on the actual topic here (Should precompilation cache path depend on system image? Should it be implemented before full solution?).

1 Like

#18

Awesome thanks! And sorry if I wasn’t clear before, but I consider your proposed change for making precompilation path depend on sysimg to be an obvious addition. IMO it’s a good, simple solution to a common case for many people.

Edit: Another thing to consider, is what to do about precompile files accumulating on the user’s system. It might be good to document where those files are found so that the user at least knows how to manually delete them.

2 Likes

#19

Oh, you are very clear in the first post that you are positive about this proposal.

I think it’s hard for users to know which file corresponds to which environment. The path is currently something like ~/.julia/compiled/v1.2/PyCall/GkzkC.ji where GkzkC means nothing to humans. I think the better approach would be to add GC (sub)command in Pkg.jl to wipe out old cache. It also minimize API surface; i.e., we can change the path system later without breaking compatibility.

1 Like