Proposed Julia Docker workflow to use a "persistent" depot

When developing a Julia library or application I typically find myself incrementally adding or modifying my package dependencies. If the particular Julia application also needs to be included in a Docker image this iterative workflow can be quite annoying as any change to the dependencies results in all of the dependencies having to be installed and precompiled from scratch. There seemed like a better approach was possible so after some research and discovering the RUN --mount=type=cache feature in Docker I came up with the following Dockerfile:

# syntax=docker/dockerfile:1
ARG JULIA_VERSION=1.8.5
FROM julia:${JULIA_VERSION}

# Switch the Julia depot to use the shared cache storage. As `.ji` files reference
# absolute paths to their included source files care needs to be taken to ensure the depot
# path used during package precompilation matches the final depot path used in the image.
# If a source file no longer resides at the expected location the `.ji` is deemed stale and
# will be recreated.
RUN ln -s /tmp/julia-cache ~/.julia

# Install Julia package registries.
RUN --mount=type=cache,sharing=locked,target=/tmp/julia-cache \
    julia -e 'using Pkg; Pkg.Registry.add("General")'

# Disable automatic package precompilation. We'll control when packages are precompiled.
ENV JULIA_PKG_PRECOMPILE_AUTO "0"

# Instantiate the Julia project environment and precompile dependencies.
ENV JULIA_PROJECT /project
COPY Project.toml Manifest.toml ${JULIA_PROJECT}/
RUN --mount=type=cache,sharing=locked,target=/tmp/julia-cache \
    julia -e 'using Pkg; Pkg.instantiate(); Pkg.precompile(strict=true)'

# Copy the shared ephemeral Julia depot into the image and remove any installed packages
# not used by our Manifest.toml.
RUN --mount=type=cache,target=/tmp/julia-cache \
    rm ~/.julia && \
    mkdir ~/.julia && \
    cp -rp /tmp/julia-cache/* ~/.julia && \
    julia -e 'using Pkg, Dates; Pkg.gc(collect_delay=Day(0))'

Using this approach the Julia registries, packages, and precompilation files are stored in a Docker cache which persists between image builds. The end result is that iterative package development results in much faster image builds as only the missing packages need to be added and precompiled, just like how local development works.

Additionally, this cache is shared between all Docker image builds so this can also help accelerate workflows where multiple Dockerfiles and Julia Docker images need to be built. That said, through some experimentation it is possible that concurrent Docker image builds can result in file access collisions so I decided to use sharing=locked to avoid running into these problems even though they seem rare in practice. The downside of sharing=locked is that concurrent builds will be slower than if we used sharing=shared but should still be faster than building all dependencies from scratch.

Let me know if this approach to building Docker applications is useful for your workflow. Maybe I’ll try to add this as documentation in docker-library/julia if this is useful.

13 Likes

Hi, thanks for the Dockerfile. It seems to work well and I struggled a while with this. I have a case where I instantiate the project but than add a line to start Julia with a sysimage generated with PackageCompiler. The sysimage is around 800 MB but the Docker image turns out to be 4 GB; that seems too big in my opinion. The sysimage needs the artifacts in .julia/ but still, I have the impression that I can reduce the size. Did you observe similar blow-ups or do you have another Dockerfile setup for deploying a Julia image with pre-compiled sysiamges?

I have a similar use case. I’d like for the compilation cache to be preserved across different docker run calls, so I am persisting the ~/.julia folder in the container with the following argument to docker run

--mount type=bind,source=~/.julia_docker,target=/home/user_in_container/.julia

this has the added benefit of also preserveing the Julia REPL history across docker runs.

However, it breaks a lot of the hermetic and reproducibility benefits of Docker, because it also persists other state, such as package installs.

What is the recommended way to preserve the REPL history and compilation cache without any other state? Could I persist only DEPOT/logs and DEPOT/compiled?

1 Like

My experience with PackageCompiler and making Julia sysimages is similar in that they really increase the image size. I haven’t been using sysimages as much recently but I tended only to use them when making a final image. It may be worth doing some performance testing to ensure that your image is seeing an actual benefit with the sysimage as package precompilation has gotten much better.

I also looked into this and ultimately didn’t go with this approach as reproducible image builds were important to me.

Could I persist only DEPOT/logs and DEPOT/compiled?

You’d also want to persist DEPOT/artifacts. Note that logs contains a “manifest_usage.toml” file which can result in Pkg.gc not cleaning up packages which is probably important as this single Julia depot could be shared across multiple images builds.

I’ve iterated on my original design here to utilize separate Docker cache’s for each Dockerfile. Doing this allows for Julia images to use stacked depots and utilize COPY --from to build off of parent images. I can post an update here if there is interest.

Additionally, I have yet to experiment with Julia’s 1.11 change which addresses precompile file relocatability. That change should simplify the Dockerfile considerably.

2 Likes

The approach I’m taking now is with these docker run flags

    julia_volumes = (
        # We want to persist some but not all of the julia depot across `docker run`.
        # We do not want to persist new package installations.
        # We do want to persist compilation cache, initializing it with the contents in the docker container
        # to take advantage of the compilation work done at `docker build` time.
        "--mount type=volume,source=foocontainer_julia_cache_artifacts,target=/opt/.julia/artifacts "
        "--mount type=volume,source=foocontainer_julia_cache_compiled,target=/opt/.julia/compiled "
        "--mount type=volume,source=foocontainer_julia_cache_packages,target=/opt/.julia/packages "
        "--mount type=volume,source=foocontainer_julia_cache_registries,target=/opt/.julia/registries "
        # Persist the Julia REPL history
        "--mount type=bind,source=~/dev_docker_persistencce/.julia/logs,target=/opt/.julia/logs "
    )

(I also have the depot directory not be in a user directory, because the docker user matches the host user.)

I haven’t run into issues yet, but I realize this is probably not orthodox.

@omus, could you please elaborate on

I also looked into this and ultimately didn’t go with this approach as reproducible image builds were important to me.

Use the depot stack? See the DEPOT_PATH variable in Julia or use the JULIA_DEPOT_PATH environment variable.

https://docs.julialang.org/en/v1/manual/environment-variables/#JULIA_DEPOT_PATH

https://docs.julialang.org/en/v1/base/constants/#Base.DEPOT_PATH

Only the first depot is writable. The others should be read only.

@omus, could you please elaborate on

I also looked into this and ultimately didn’t go with this approach as reproducible image builds were important to me.

Using a bind mount to re-use the Julia depot between the host and Docker containers can be problematic as the depot’s environment manifests (e.g $DEPOT/environments/v1.9/Manifest.toml) can result in unneeded packages being left in the Docker image even after running Pkg.gc(). If you Docker image builds with multiple architectures Julia will end up removing .ji for other platforms which can result in more pre-compilation churn. Finally, there is a time cost to transferring the Docker context and a large Julia depot shared across multiple build containers can be quite slow.

Utilizing a volume mount like you can work for sharing a Julia depot across running containers. However, I believe volume mounts aren’t a supported option when building a container. Additionally, the volume approach requires a pre-build step which I wanted to avoid which is why I ended up utilizing cache mounts.

My specific use case was focused on baking a Julia depot into an built image while keeping the image size and build time to a minimum.

1 Like