Large vs Small Packages

question

#1

I’m considering splitting a large monolithic package into 3-4 smaller ones. The thought was they might be easier to maintain, migrate, test, etc. But I am a bit concerned that the “admin overhead” will actually increase. E.g. I’ve been watching the Optim packages a bit and it didn’t look all that straightforward.

It would be very helpful (hopefully also for others) to hear people’s experiences with similar situations and the choices they’ve made and why.


#2

I am surprised why you would think that. Having everything in one place has the following advantages:

  • All tests are run together. With split packages, a change in your package might break something in another so you need to worry about compatibility between packages.

  • Easier to release, you release once and that’s it.

  • Less “burden” on the ecosystem, fewer releases being made, less stuff for the resolver to look at etc. Not a big deal but at least something to consider.

It also has the following advantages for users / developers:

  • All the source code in one place, easier to get an overview of how the components of the package fit together.
  • All docs in one place (or alternatively, no need for a separate doc repo).
  • All issues in one place, easier to find similar problems.

The negative for users:

  • Possibly longer load times if they would happen to only need to load one of the split out packages.

Unless there are strong reasons to do so, I would keep the stuff together. There was probably a reason you put the functionality it in the same package, to begin with.

I have split a package once and that was when I split out https://github.com/KristofferC/Crayons.jl and https://github.com/KristofferC/Tokenize.jl from https://github.com/KristofferC/OhMyREPL.jl. I am happy with these because these two packages were very isolated and are quite useful on their own.


#3

I just split one of my packages into 4. @kristoffer.carlsson was right about the pains he mentioned, but… for me it was more or less the fact that each of the 4 parts depended on a different set of packages. So having it all in one meant a huge list of dependencies, having 4 meant 4 neat separate ecosystems.
I’m still unsure if I did/doing the right thing.


#4

This may be useful if others depend on the package, but your “final” reexporting package will still depends on all the same packages.

It has been somewhat useful to split functionality out from Optim in some sense, but really it’s a huge burden. All the package have to fit together anyway. That said, the long term plans will still lead to a state where the refactoring makes sense (LineSearches being its own package is one such advantage). We could merge LsqFit, NLsolve, and Optim I guess, but then the name seems a bit odd. I’m not going through another pkg refactoring anytime soon, that’s for sure.


#5

I would base this decision on whether there is a functional, self-contained collection of code that is potentially useful for others beside the rest of the original package. If yes, split; if not, or when in doubt, don’t.

The admin overhead will increase, but it should be small compared to the advantages above.


#6

Yes, this. With DiffEq, we got it wrong at first and it increased the admin overhead. Overtime, we got it right and there’s many advantages. For one, the complete test set on just the ODE solvers is more than 60 minutes so that needs to be split in Travis anyways. Coming up with schemes to test everything together was a nightmare.

But secondly, the code got really screwy when trying to cram in ODE solvers with SDE solvers and an FEM toolkit and some parameter estimation and … it was just too much. There was a huge burden of entry which blocked contributors from joining. In DiffEq you’ll see that most of the contributors and GSoC projects work on the “add-on” parts like parameter estimation which just require being able to use DiffEq, and one of the reasons for this is because these are small packages which are really easy to get a hold of. A giant package is intimidating!

The main reason why we split (now more than a year ago) was because of suggestions by many users. Basically, everyone wanted to depend on one thing, and our dependency list kept increasing. Some people wanted only the native Julia ODE solvers, others only the SDE solvers, others only Sundials, others wanted the whole stack. In one repo, there’s no way to make a choice like this which means sooner or later you become a huge dependency. Notice that almost all dependencies on DiffEq are actually on OrdinaryDiffEq.jl and not DifferentialEquations.jl, and that’s on purpose.

In the end, there’s a lot of advantages if you get it right. It took us awhile to get it right. The main thing is that you should make sure that test dependencies go one way. OrdinaryDiffEq relies on DiffEqBase, so DiffEqBase shouldn’t rely on OrdinaryDiffEq for testing. This makes interface breaks impossible. Instead, now DiffEqBase tests the interface with a small testing Euler method setup in the repo. That breaks the circularity and allows for DiffEqBase to change, pass tests on its own, release with an upper bound, then propagate the change. If the interface changes at the Base level are also rare, then this is a pretty smooth process.


#7

Everybody : thank you for the very helpful responses. I take away from this that some good reasons to split a package are

  • there is a clear case for independent use of a sub-module
  • making life easier for contributors

I’ll probably take some time now to reconsider this, or at least go much more cautiously.

@ChrisRackauckas : a huge test set is one of my issues, how do you split it?


#8

Different repos or use Travis environments. For the latter, see:

Gives three different tests:

https://travis-ci.org/JuliaDiffEq/OrdinaryDiffEq.jl

and they are distinguished by environment variables:


#9

great - thank you!


#10

I would also favor splitting the packages, but under the following conditions:

  1. Packages are self-contained. For example, X.jl does one functionality or a suite for one purpose or provides one struct.
  2. Users or packages may just need/want a subset of the code.
  3. You could still use a master module that just calls and manages the other ones or the sub-ecosystem would benefit from having a base package (provides common inference, API, etc.)
    In general, I favor smaller packages.

#11

How do you “group” the smaller packages though, a parent folder in ~/.julia/v0.6/, a github organization, other?


#12

If by group you mean for development and maintaining probably through a Github org. For functionality, pipeline, battery or whatnot, have the main module call them.

module MainModule
    using SmallPackageA, SmallPackageB, SmallPackageC
    # Code
end

#13

Reexport.jl is nice for this. DifferentialEquations.jl is mostly an empty repo that just packages everything together and adds some org-wide default handling:


#14

To me, it is quite telling that the first thing people try to figure out how to do, after fracturing their package, is how to put it together again.


#15

That says nothing because you kind of have to do it to keep compatibility unless you plan on deprecating the larger package.

What is more telling is who is using the metapackage. In DiffEq, it seems that sooner rather than later most people “graduate” to using individual pieces (usually a specific solver library). This tends to be quite helpful in chats too since using DifferentialEquations has a high correlation with “is a new user” (I actually only write that for tutorials :smile:). But that doesn’t make the metapackage useless. I think at this point its clear to say that the sense of completeness and branding that is offered by the metapackage and shared documentation has successfully achieved its goal of giving a solid entry point for new users. Entry points for users and developers don’t have to be (and probably shouldn’t) be the same since they have vastly different requirements.

I think that the issue that many people completely miss is that a package is so much more than its functionality. A package, or any piece of software, is far more than “what it does”. If Julia was only a Github repo, nobody would be using it. “Julia” and “Julialang” references not only the code, but the websites and forums, the Twitter hashtag, the documentation, the standard library, and the governance. We use Julia because the combination of these. When JuliaStats presents MultivariateStats.jl, it doesn’t give any assurance to the higher level questions: who’s the community I can engage with, is there a governance structure which will keep this going in the future, and how does this relate to other tools? Putting it in a Github organization helps get rid of “this is just one person’s weekend project from 2 years ago that he/she doesn’t care about anymore”, but only a little since there’s a lot of abandoned projects in an organization. Also, “going to Github” is an action which I would already consider a developer thing rather than a user thing.

To give a sense of maturity and stability, you need to have a single landing that says “this is what we have, and these are our community channels”, and it needs to be pooled enough to have high activity. Yet at the same time, no developer ever wants an 80 package dependency for a single ODE solver or a dimensional reduction technique, so the modular structure is necessary in conjunction with the storefront.

I hope that other Julia developers start being more conscious about this issue. JuMP, Juno, Plots, and DifferentialEquations.jl aren’t any more amazing than some of the stuff going on in JuliaStats or machine learning, yet the former have this “landing and branding with dev targets” that helped them become more well-known while the latter have a bunch of “oh, I didn’t know about Distributions.jl” (an amazing enough library that if properly showcased can definitely be seen as an R-killer, but instead it looks like an obscure developer’s tool). I see Flux.jl moving in the right direction, while I see other areas like statistics currently being highly resistant to “unionizing”.

Of course this is just my opinion on how the ecosystem has evolved, but I think that when looking at it in this light there’s a quite a large amount of evidence that becomes visible. And I’m not going to say it’s perfect or even a good choice for every project, but it has been demonstrated as a good way to scale a project.


#16

I’m sorry, but while these are good comments, I don’t see how they are at all related to the topic. Nothing of this seems to have anything to do with whether one should use git-repos as a substitute for folders (said in jest :slight_smile:).


#17

This topic has been discussed in detail already multiple times before