Distributing scripts with self-managed dependencies

I am looking for a best practice solution to the following situation.

I am developing a data processing pipeline for my team. All of the core functionality I have wrapped up in a standard julia module (file I/O with our custom binaries, type definitions, functions for each step of processing, methods for downloading external resources, etc). The “pipeline” itself is composed of a number of high-level scripts that collect command-line arguments and internally call the library functions. Some of these scripts also conduct diagnostic tasks (GUI planning tools, data-quality plotting, etc). All of these scripts live in a bin/ directory within my module’s top-level (same level as src/, deps/, test/, and docs/).

Most of my colleagues don’t have any experience with Julia. I would like to minimize the amount of steps necessary for my team members to get the pipeline up and running on their own machines (as well as the least amount of support needed on my end). However, since this pipeline is still in active development, I do not want to have to rebuild binaries to redistribute each time there is a change.

Currently, this is the current instructions I have listed for my team:

  1. Download and install Julia (1.8 or greater). Add it to user path.
  2. Git clone our (private) repository
  3. Launch Julia REPL. Enter package manager mode and dev path/to/cloned/repo. (I am using dev instead of add since I could not figure out an easy way to get Pkg to remember our Gitlab ssh key, and having to enter this in each time to update the module is too much of a pain)
  4. Manually install (add) a list of all julia script dependencies
  5. Add the bin/ folder to user path so that all of the scripts are immediately available for use

I then let my team know when they should git pull once new features of interest to them are available. The biggest issue with this approach is that occasionally I add a new dependencies to scripts, and then have to invariably deal with individual emails fixing people’s environments. I would really like to be able to maintain a Project.toml file (or something equivalent) instead within the bin folder which the scripts could automatically activate on their own. However, I cannot seem to figure out a clean way to do this. I have been trying to use the shebang in the script with something along the lines of

#! /usr/bin/env -S julia --project=@. -e "using Pkg; Pkg.instantiate()"

but this has the two-fold problem that (1) the user needs to have the module’s bin/ directory be their active working directory (which defeats the purpose of adding the pipeline to their path and being able to run it wherever actual data files for processing live, and I can’t hard-code a path instead because I have no way of knowing a priori where a user will git clone the library) and (2) that since my module is not registered in the Julia registry, this invocation leads to an error anyway.

Is there another recommended approach I should be taking instead? I see another seemingly related thread here, but I do not want to have to have my teammates activate each time they run the scripts. Ideally, they should “just work”.

Thanks!

I add a script like setup.jl to the repo and document that it should be run after cloning or pulling.

#!julia
using LibGit2, Pkg

if !isdir("dev/UnregisteredOnGithub.jl")
    LibGit2.clone(
        "https://github.com/someone/UnregisteredOnGithub.jl.git",
        "dev/UnregisteredOnGithub";
        branch = "working_branch",
    )
end

# or
if !isdir("dev/UnregisteredLocal.jl")
    cmd = `git clone -b working_branch git@gitlab.mycompany.com:someone/UnregisteredLocal.jl dev/UnregisteredLocal`
    run(cmd)
end

# Use the current directory for Manifest and Project.toml
Pkg.activate(".")

# a helper to dev a bunch of packages at once
dev(pkgs) = Pkg.develop((p -> PackageSpec(path = p)).(pkgs))

dev([
    "dev/UnregisteredOnGithub",
    "dev/UnregisteredLocal",
    "SubPackageInThisRepo",
])

# And finally, pull all the packages from the registry as specified in Project.toml
Pkg.resolve()
Pkg.instantiate()

This is fine but consider using GitHub - JuliaLang/juliaup: Julia installer and version multiplexer. If nothing else it gives you an easier upgrade path which you can script for your colleagues.

Don’t do this.

Yes, this is the way to go. Don’t try to activate and instantiate in the shebang, instead start your scripts with

using Pkg
Pkg.activate(@__DIR__)
Pkg.instantiate()

This is an achievable goal. Feel free to ask if you have more stumbling blocks.

I don´t know what’s about that ssh key, but if you make your code a proper package, you will just tell the users to “add” the package and that will the care of all dependencies, with proper versions, etc.

(absolutely tell them to install julia with juliaup).

This is one way to have scripts working with minimal startup time: Creating a standalone app on different platforms - #3 by lmiq

1 Like

Highly recommend this +local registry. Then it’s just

  1. Install Julia
  2. ] registry add https://GitHub.com/your_user/YourRegistry
  3. ] add YourPackage
1 Like

It’s a good idea to put the computational parts of the scripts inside a package but it doesn’t automatically integrate with a script-driven data pipeline environment, which I imagine might also include scripts in other languages than Julia. Unless you have super-strict compat it also doesn’t provide the level of reproducibility you get from a manifest, which you probably want to have for a data pipeline.

This is how I have set up a data pipeline repository at work, simplified and anonymized:

├── docker
│   ├── build.sh
│   ├── Dockerfile
│   └── run.sh
├── julia
│   ├── julia_script1.jl
│   ├── [more julia scripts...]
│   ├── Manifest.toml
│   ├── Project.toml
│   ├── src
│   │   ├── OurDataPipeline.jl
│   │   └── [julia source files...]
│   └── test
│       └── runtests.jl
└── python
    ├── python_script1.py
    └── [more python scripts...]

The Dockerfile installs Julia and a well-defined set of Python packages. It also sets JULIA_PKG_SERVER to point to a local package server which serves packages from our local registry.

The julia directory is a locally registered package but the pipeline scripts are intended to be run from a clone of the repository and using the docker image. All Julia scripts begin with

using Pkg
Pkg.activate(@__DIR__)
Pkg.instantiate()
using OurDataPipeline

This has turned out to be very reliable and does indeed just work. Almost all problems we have run into have been on the Python side.

Comments:

  1. This predates juliaup and if I would do it today and didn’t have Python scripts in the mix I would probably just go with juliaup instead of the docker image.
  2. Having the scripts in a subdirectory would arguably be cleaner. One reason why I didn’t set that up is that the scripts are intended to be run with julia julia/script.jl ARGS... and I didn’t want to add another path component. With a PATH + shebang design there’s not much to hesitate about.
2 Likes

Hi all, thank you for the advice!

Works like a charm! This is exactly what I needed!

~~~

Good call; I had forgotten that juliaup existed.

I am running into this mess if I try to go the add route: SSH auth keys, just very painful outside of ssh-agent. · Issue #911 · JuliaLang/Pkg.jl · GitHub. The current usage of local git clone + dev, while not ideal, seems to be the path of least resistance at the moment. Eventually, we will make the repo public and then all of the problems will go away.

I will look into this, thanks. Though my package is likely the only one that will ever exist in such a local repository, so unless this somehow circumvents the ssh key issue, I’m not sure if this is worth the time investment? Perhaps I am misunderstanding.

@GunnarFarneback Follow up question (mostly a nitpick, not a huge issue): Is there an easy way to suppress the message

  Activating project at `path/name/here`

without necessarily suppressing any followup messages (i.e. if an installation needs to happen, etc). For actually running the pipeline, it’s not a big deal because a lot of text is output anyway, but it is slightly annoying if you just want to view the script’s help text or version info.

1 Like

Yes.

redirect_stderr(devnull) do
    Pkg.activate(@__DIR__)
end
1 Like

Thanks! Looks like I can also just do

Pkg.activate(@__DIR__; io=devnull)

Even better. I had forgot that option (which might not have existed when I learned to work around it).

Consider using the fantastic GitHub - GunnarFarneback/LocalRegistry.jl: Create and maintain local registries for Julia packages.. Maybe the comments here give inspiration: Survey on how you use Julia - #13 by aplavin