How Beacon Packages Julia Code (In A Monorepo)

:wave:

@erichanson recently linked this discussion in Beacon’s Slack, and we were discussing how interesting it is to get a window into other orgs’ approach to this sort of thing. It made us realize that it might be useful for us to share a bit more about Beacon’s approach, so I wrote this up post-feast yesterday - hope it’s interesting!

Beacon’s Monorepo Background/Motivation

It’s probably important to preface the rest of this content with some background on Beacon’s motivation for a monorepo in the first place, because that motivation pretty directly informs our apporoach. I make no claims that our practices will necessarily be applicable to folks in different environments :slight_smile:

At Beacon, we actually generally default to a multirepo approach for all independently useful library-style packages, but all of the production services/applications that underpin our core platform are housed in a single monorepo. Not all of the code in this platform monorepo is Julia, but a good chunk of it is.

Paraphrasing from a relevant section of Beacon’s architecture journal:

We’ve chosen to develop our platform in a monorepo fashion to start, but reserve the option to break it out into a multirepo configuration at a later time. Our platform consists of multiple “components”: loosely coupled services/applications all developed in accordance with the same CI/CD configuration, infrastructure, and engineering practices.

We’re building multiple different components with a comparatively small team. By developing in a monorepo from the get-go, it becomes a bit easier for us to…

  • …impose/enforce uniform cross-component practices/structures across the codebase
  • …rapidly prototype new cross-component structures across the codebase
  • …lower potential cross-repository synchronization overhead during a development period in which different cross-component boundaries are still being explored

As our platform grows, each given component’s API boundary matures and dedicated teams may evolve around specific components. When a given component reaches that point, we may choose to split out a matured component into its own repository in order to enable its team to function more independently.

In other words, we started with the monorepo route moreso to incubate greenfield development efforts executed by a small team and provide a low-overhead ramp to architectural maturation, rather than out of a desire to long-term opt into (and/or optimize for) the tradeoffs traditionally associated with a full-blown monorepo paradigm.

For the most part, our top architectural priority when we started was to design/implement ideal cross-component API boundaries, and we figured a monorepo structure would give us the optimal environment to rapidly prototype (and battle-test) such boundaries without incurring additional overhead associated with “baking down” a given architectural scheme into a given repository structure. By now, these boundaries are pretty much drawn/stable, but we still haven’t broken out the monorepo into a multirepo since a strong enough need hasn’t arisen that would drive us to do so. Probably will eventually do so, though.

I wanted to explicitly call out this background here, because it affords Beacon some leeway that might not be afforded to a team that was aiming for a “much more monorepo-y” approach to their monorepo. For example, our monorepo relies on Beacon’s internal package registry, which lives in its own repo, not in the monorepo (though I suppose it could, if we desired that).

Relevant Beacon Practices

With that background out of the way, here are the actual relevant Beacon practices that I thought it’d be valuable to share, heavily paraphrased from Beacon’s internal documentation. Beaconeers might note that I’ve added/removed details as needed for a general audience, and have translated a few of our language-agnostic practices into Julia-specific manifestations of those practices for this post.

Each Julia package developed at Beacon (including those in our monorepo)…

  • …should be developable/deployable, testable, versioned, released/registered, and documented independently of other packages, for some reasonable interpretation of “independence” (hopefully well-enough characterized by the rest of the points). Importantly, no package should depend on another package’s encapsulated implementation details, only on documented APIs.

  • …whose version is >= v0.1.0 may only depend on Julia packages that are registered in a package registry (either General, or Beacon’s internal registry).

  • …must declare compatibility bounds for all non-Base dependencies in their Project.toml. At a minimum, dependencies must be upper-bounded at their most recently released major version (or minor version, if the dependency is in a v0.x.y release series).

  • …should not contain a Manifest.toml checked into version control, if intended to be used as a “library-style” dependency of another package (i.e. it’s sole purpose is to provide code that should be directly invoked from within other code, and doesn’t back a standalone service/application). This forces developers (and more importantly, CI) to independently resolve the package’s dependencies, which is more consistent with downstream environments in which dependencies will be independently resolved without regard to your personal Manifest.toml. This practice also prevents a shared Manifest.toml from accidentally masking reproducibility issues with the package’s declared compatibility bounds in freshly-resolved downstream environments.

  • …should not contain “checked-in direct filesystem-level dependencies” on any content that is not owned by the package. All such dependencies should instead be intermediated by explicit APIs and/or proper package management. For example, a package should never directly include a script/file that lives outside the package. Another example, which touches on the previous Manifest.toml requirement: Imagine you’re developing the Julia packages A.jl and B.jl, and A depends on B. A’s dependency on B must be declared against a registered/released version of B, not declared as a filesystem reference via Pkg.dev. Package authors may still utilize dev locally for development purposes, of course, as long as they do not check in a Manifest.toml with a dev’d dependency.

  • …should maintain its own unit tests, and - if useful, especially for application packages - integration tests. Each package’s unit tests should be runnable/passable via Pkg.test in CI without requiring a checked-in Manifest.toml. Each package’s integration tests (if such tests exist) should stress interaction points with targeted direct upstream components; these tests should not target indirect upstream components. In other words, test against your dependencies, not your dependencies’ dependencies.

Note that in practice, the composition of these rules can cause cross-package changes to require multiple PRs - something that a more monorepo-centric team may seek to avoid. For example, if I have A.jl and B.jl, and A depends on B, then it takes at least two PRs to land a breaking change in B and propagate it to A:

  1. A PR is opened/merged which implements the B change

  2. Once this PR has landed on main, the B change is tagged/registered with Beacon’s package registry

  3. A second PR is opened/merged which propagates the change to A

  4. Once this PR has landed on main, the A change is tagged/registered with Beacon’s package registry

We consider this a desireable feature of our approach, but YMMV.

Conclusion

I can’t speak for how widely applicable it is, since I only have “anecdata” from within Beacon, but I hope this post is at least interesting/helpful to folks who are curious about how Julia code can be internally packaged within an industry setting. For us, at least, I feel that following these rules - which isn’t always easy to do - generally allows a bunch of other things to “just” work.

One last thing I’d like to note: I’m consistently blown away by how Pkg’s thoughtful design supports so many organizational configurations so cleanly, as long as you “go with the flow” of its design. At this point, it’s hard not to feel like other packaging tools/ecosystems are a PITA by comparison. A big shout out to all of Pkg’s authors/contributors for such a great ecosystem-empowering tool.

37 Likes

If you require a separate merge for a package change that requires another package change within the same monorepo, where does the benefit of the monorepo lie for you? There’s an implied use case in the post above where developers can iterate on the package together by using dev locally, thus testing multiple interdependent updates concurrently, is that where the benefit lies? Are there other benefits?

2 Likes

We’re also delivering industrial code as a small team with multirepo, and we follow basically the same guidelines.

On top of that, we have:

  • Automated CI for all packages at midnight, to catch unforeseen integration problems
  • Major versions are for major architectural shifts, and minor versions can include breaking changes. It doesn’t make sense to agonize over breaking changes (and compat bounds) when everybody’s using the latest versions anyway and PRs can be coordinated across affected packages.
  • A github repo for greenfield R&D that is a grab-bag of folders with Manifest, code and jupyter notebooks. Research outcomes often end up shared in github issues. This, I find somewhat unsatisfying, and would be curious to hear how others manage.

Thank you for sharing!

:100:

2 Likes

Super great to hear about this kind of organization/setup.

Due to historical reasons (i.e. pre-custom registry support), at RelationalAI we currently have a monorepo setup with internal packages all relatively located inside the monorepo. These “unregistered” packages are referenced in the project (currently) by manually editing the top-level Manifest.toml file to include the relative paths to the subpackages so they’re kind of like “devved” packages for the monorepo.

This mostly works when working from the top-level, but we currently have a painpoint that if you try to work in the environment of a subpackage, we lose the context of the parent Manifest.toml and then there are Pkg errors on operations when it can’t find peer internal dependencies. We could similarly checkin the Manifest.toml files for each subpackage, and manually edit those Manifest.toml files to have entries to the relative path internal dependencies, but phew, for several dozen internal packages, that becomes quite a large administrative overhead.

This PR is our proposed solution to making subpackages within a monorepo more first-class. It proposes allowing a [sources] section to the top-level Project.toml that would allow specifying the relative paths to internal subpackages, thus avoiding the need to manually edit our top-level Manifest.toml. The other proposal is allowing subpackages to specify parent = "ParentProjectName" in the subpackage Project.toml, which has the affect of, when activating the subpackage environment, walking up until the parent Project.toml is found and allowing the subpackage to “inherit” the [sources] section of the parent, thus “fixing” the issue for how subpackages can be aware of their internal peer dependencies.

There’s still a few awkward points that we’re working through for how to making working in a monorepo seemless, including:

  • When you change the dependency list of a subpackage, you need to then call Pkg.resolve() in the parent environment to ensure the top-level, checked-in Manifest.toml gets the dependency change recorded
  • How exactly to deal with [compat] sections between the parent project and subpackages

We don’t see this as nearly painful as the other 2 aforementioned points, which is why we’re limiting the PR to just the 2 proposed changes, but it would be nice to improve things even further in the future.

It’d be great to hear if this kind of setup also sounds helpful to others.

5 Likes

Thanks everyone for sharing their experience, I love reading how others tackle this organizational challenge!

At ASML we set up an entire “inner-source” ecosystem across multiple repositories, we currently have 140 packages with a local registry. I finally had to time to reflect and write down some of our design, so I turned it into a blog post: Building a Scalable Inner-Source Ecosystem For Collaborative Development

I agree with the statement about Pkg. And kudos to LocalRegistry.jl. Without those I would have struggled tremendously to kickstart our ecosystem. I’ve witnessed the complexity of internal PyPI systems, that’s no fun at all.

14 Likes

We (Zipline) have a similar setup to RelationalAI in that we have a monorepo and we kinda manually build Manifests ourselves.

We were using an internal registry up until recently, but we decided to move away from that for a few reasons. First, it takes a really long time to load code from a private repo via the package manager vs. just manually cloning the monorepo. Second is just that most people in our org just don’t like the multi-PR-and-register workflow that comes along with using a package registry. Personally, I like it because it matches the workflow I’m used to developing packages outside of work, but I see where they’re coming from. Like @iamed2 said, it really takes away the benefits of working in a monorepo.

Our temporary setup and workflow (until something like @quinnj’s PR goes through) looks like this:

  • No checking in Manifests (for packages, at least; end-user analysis projects still check in Manifests). We started by checking them in and then running a script that would resolve them in the right order, but this started breaking in ways that were really tough to debug. So now no Manifests for packages.
  • Run a setup script that calls add on all packages from the General registry and dev on the local path of packages in the monorepo.

It kinda works, but there are a few pain points:

  • Users have to call a setup script and wait for the environments to be built before doing anything. This can be especially painful when changing branches often.
  • As @quinnj noted, changing any dependencies in the chain requires calling resolve on everything upstream of the change. It’s usually easier to just rebuild the Manifests with the setup script again.
  • Test environments can be difficult to work with when you have local dependencies. Honestly, in most of our packages we tend to just skip test environments and add test dependencies directly to the package environment (which makes me sad)

If anyone has any suggestions, I’d love to hear them!

5 Likes

For that, I suggest you to have a look at git worktree. That’s its very purpose, namely, to have many branches “open” at the same time … so no more branch-switching.

1 Like

Many thanks for these explanations! I find this very interesting.

Could you please share a bit more details about this? In particular I’m intrigued about the relationship between such a script and the Project.toml files: does the setup script rely on the info provided in Project.toml? Does it complement it? Or is there some redundancy/repetition between the Project.toml files and the setup script?

Could you please share a bit more details about this?

Definitely! It’s basically a function that crawls through the monorepo parsing the Project.toml files it finds and writes that to a Dict that’s basically package_name::String => metadata::NamedTuple{(:dependencies, :path), Tuple{Vector{String}, String}. It then goes through each of the packages in the list, filters out any dependencies from the General registry, recursively calls itself on all of the remaining (local) dependencies, then passes the list of local dependencies to Pkg.develop. I’m skipping a few details, but that’s how it generally works.

3 Likes