Fixing Package Fragmentation

These are the sort of decisions I absolutely hate working on. It’s mostly bike-shedding until you reach a reasonable conclusion, then you have to spend the next months-year explaining over and over why they are seeing deprecation warnings and didn’t use the special syntax they would prefer.

4 Likes

I have experience both in choosing to work together and in preferring to not work together. Perhaps sharing my experience can help shed light on some ways forward.

For the former, let’s take GitHub - gdalle/ImplicitDifferentiation.jl: Automatic differentiation of implicit functions and GitHub - ThummeTo/ForwardDiffChainRules.jl as examples. The core of these 2 packages started as features in my GitHub - JuliaNonconvex/NonconvexUtils.jl: Some convenient hacks when using Nonconvex.jl. and they were not very well tested or documented yet. The current owners of the 2 packages above reached out and expressed interest in starting a package for each of those features. I supported that and started contributing to their packages because why not? The 2 projects turned out great and ended up with more users, stars, and PRs. Everyone is happy. In this case, the key was the natural coalescing of interests and the willingness to give up a little bit of control over a package in exchange for more dev time to be put into the package. Today I don’t control any of these packages and I am fine with being just a contributor.

The second example is GitHub - JuliaNonconvex/Nonconvex.jl: Toolbox for non-convex constrained optimization. vs GitHub - SciML/Optimization.jl: Mathematical Optimization in Julia. Local, global, gradient-based and derivative-free. Linear, Quadratic, Convex, Mixed-Integer, and Nonlinear Optimization in one simple, fast, and differentiable interface.. Optimization.jl has more pull because it’s a SciML project and it follows the standard SciML API. Many people like that. I don’t and that’s ok. Both Nonconvex.jl and Optimization.jl had similar development timelines and I could have chosen to contribute to Optimization.jl and abandon Nonconvex.jl at any point but I chose not to. My reason is that I wanted to have the ability to experiment with new APIs, new optimisation algorithms and AD hacks and I wanted to have full control over the package. In this case, I wasn’t willing to relinquish control and my interest to try new things in Nonconvex.jl did not naturally align with the interests of the Optimization.jl devs to follow the SciML API. There were more differences between the 2 packages as well so I am over-simplifying things a little.

I think if we can learn anything from these 2 experiences, it’s that when we have limited resources, getting together may end up being good for everyone. But no one should feel pressured to contribute to another package instead of starting their own. Having more channels where people of similar interests can reach out and collaborate would be a great thing. Perhaps JuliaCon should not be just a yearly thing. If there is a JuliaCon every 2 months online with different themes, I will be happy to listen in. This may be a way to break silos and get more people to reach out and collaborate.

23 Likes

And that’s totally fine! I think the two are moving in different directions and will complement each other.

The reason for not including Plots.jl is actually technical. There’s a redirect on its main page that puts you into the wrong site. It’s from a very early version of Documenter and I just haven’t patched that in the gh-pages of Plots.jl yet.

I think in general that’s fine, but I also think that in practice people don’t really understand the effort and patience required to maintain a package over years. I do see some packages that are nice small personal projects that I’m like “please put this in an org so it’s still up to date 4 years from now”. Simple things like bumping dependencies, reporting upstream regressions, continuing benchmarks, etc., It’s not hard but we do need to ensure that the satellite projects do keep that up. I think some people don’t move something into an org because they think it will look good on a CV or publication to be some personal project, but… you can publish and we can still maintain it :sweat_smile:. Its this unsexy work, continuing to maintain and improve the documentation, answering on Discourse and keeping an FAQ that grows over time, that’s the stuff that makes a package eventually mature, and we should be a bit more diligent about helping especially graduate students understand this longer term process.

This is why I’m so adamant that all GSoC projects are in orgs with other maintainers. Some day, maybe 10 years from now, someone may get bored or just busy, and when that day comes, having a structure to train the next maintainer is essential.

But then again, not every great project needs a large org and a lab. A lot of projects are perfectly fine as personal projects with one professor maintaining it for years.

11 Likes

I understand that but an author can also just add more contributors to the project with merge rights. Worst case scenario, an org can fork a project (or copy-paste it and give credit) if a project is abandoned but is still important to an org. So there is more than one solution to this problem. Educating people about maintenance burden is definitely beneficial though. Maybe everyone who wants to register a package can be asked to watch a video or read an article explaining what it takes to maintain a package long-term.

If there are too many abandoned packages, we may consider having a formal process of unregistering packages from the General registry. Another suggestion to “solve” the fragmentation issue for users is for some people to maintain a new registry. If I maintain a “SuperHighQualityRegistry” which has only packages that I deem to be of high quality, and if people trust my opinion enough, this registry may become popular. Then I can have a monolith documentation for all of the packages in my registry. Creating a new registry in Julia today is almost as easy as creating a new crypto, more people should do it!

4 Likes

My response was flippant, because I consider the proposal (especially the part about forcing merges, or making it more difficult to create new packages) absurd.

But I should have explained that in detail, instead of resorting to sarcasm. I will do that now.

Similar packages addressing the same functionality usually exist because even if they provide similar functionality, the trade-offs between code complexity, speed, and generality are addressed differently. Eg to read tabular data, you have DelimitedFiles.jl, recently uncoupled from Julia, and CSV.jl, and a couple of other packages. The first one is simple and not optimized for large data, the second is quite complex and features lazy reading and a very fast parser. They do the same thing, more or less, but they do it very differently. The first one should be very easy to contribute to, as the code is quite simple, but the second one would require a much larger initial investment in understanding the code.

The same applies to AD libraries. There have been more than 10 experiments so far with reverse-mode AD. Each led to knowledge that was used by the authors of the next (frequently the same people), yet not all of them were abandonned since we still do not have a robust fire-and-forget reverse mode AD solution. There is no meaningful way these packages can be “merged”; perhaps at some point one solution will emerge as dominant, and the rest will fade a bit, but since existing code uses them it still makes sense to fix some issues, so they will be around for a while.

Consolidation does happen in the Julia ecosystem, but it is a slow process, and usually involves a lot of work. Simply appointing a “czar” is not going to make this magically happen any faster. When you have packages that are functioning OK and each has a set of users, consolidation requires that their code is refactored, imposing a cost on them (unless, of course, they pin versions, but then no updates). Or a unified API layer package, or some other similar solution. There is no free lunch here.

At the same time, code that is independently useful is frequently factored out to smaller packages, which is a good thing. This may look like “fragmentation”, but it usually makes code easier to maintain and improve. An example is LogExpFunctions.jl, which was factored out from StatsFuns.jl about two years ago, and received many high-quality PRs since.

People who think that the community should restrict the General registry in any manner (beyond the existing minimal requirements) may be missing the fact that it is very easy to start your own registry. If some users are willing to maintain a registry of curated packages, they can do so today, without any hassle.

32 Likes

That’s really a GREAT idea!

But that relies on luck, which I’m saying a (well run) org can help limit. Organizations like SciML can take scope of what’s going on and try to be proactive rather than reactive. For example, one of the big issues in the last year was that we had lost contact with Kirill, maintainer of NeuralPDE, due to world political events. Because of this and the interest in the library, I gave the library a bit of a refresh, moved a bit of CZI funds to get a new maintainer on there, and focused 4 of the GSoC projects towards this library in hopes of training the next batch of maintainers. This kind of targeted action doesn’t tend to happen without some kind of structure behind it. I guess a professor can specifically look for a new PhD student whose interests align with a library that has been left behind, but it’s much easier to have a larger pool of resources.

7 Likes

There is 0% luck involved in forking a project, it always works :sweat_smile: Of course, I see the value in being affiliated with a well funded org but I am just saying that no one should feel pressured to join an org. A GitHub org is just a group of 1 or more contributors. So it’s not the org that matters, it’s the people and money behind the org that make the difference.

1 Like

And it takes a lot of knowledge. I don’t think it’s even in general possible. This is why I chose to do SciML and I’m pretty adamant about not extending that to “normal ML” or data science. Those are outside my field of expertise. I won’t do as well if I try to do that. I’m not saying Julia shouldn’t have someone do something similar for deep learning or data science tooling, I’m saying that there’s more than enough on my plate already (there’s more than a few libraries in SciML I am not happy with) and someone else should take those domains.

R has CRAN Task Views of this form with domain-specific authors:

I think there’s a lot to learn from that.

3 Likes

We’ve had this discussion (more focused on statistics) already:

3 Likes

Most of this has been touched on before but that doesn’t mean it’s not worth revisiting with renewed energy and direction; but I’m not sure how helpful it is to try to address all of these problems that require different solutions in a single thread.

If it make anyone feel any better, this did kick me into spending all night digging through my old code so I could fix some of my newbie mistakes made in ArrayInterface.jl and support very simple dimnames and index_labels methods like what you see in DataAPI.jl in the future.

8 Likes

Unfortunately, it’s a bit backward. Most of the packages that we’d like in our curated registry are already in the general registry. What would be more useful would be a copy of General with all the packages removed that have no readme and are only there to share a repo among members of a lab, none of who know how to maintain a local registry… OTOH, who knows, maybe maintaining a registry of what one considers relevant packages is worth trying despite the duplication.

Here are a couple of local, low-energy, things you can do.

To avoid putting things in the general registry prematurely:
I maintain a public registry (This is super easy with LocalRegistry.jl) for packages that dont belong in or are not ready for General. I sometime have packages that depend on these, so the README says you have to install this registry. I haven’t seen anyone else do this (I’m sure someone has) I don’t how it would work out if it were more common to see smaller registries in use publicly.

This is obvious and well known, but I’ll repeat if for this thread. To aid in discoverability and aid in evaluating overlap and consolidation possibilities:
Like in a science paper, you can put a few sentences in the readme putting the package in context in the larger ecosystem. If you don’t want to spend the time, you can at least add a list of related packages. Or if you don’t want to spend even that time, you can do what I did with an enums package that I put in the general registry; add a link to one package and say something like “See EnumsX.jl and packages referenced in its README”. The last one is still kind of negligent, especially for the fourth or fifth enums package, but much better than saying nothing.

1 Like

Whats the benefit of maintaining a personal registry for packages that are already in the general registry

What are the externalities of registering packages in the general registry? Should I feel bad about the dozens of packages I already registered?

2 Likes

In general, I think it’s nice to avoid polluting the registry with half-baked projects. I’ve definitely done this before when getting over excited about something (add it to the list of things I lie awake at night thinking about). I think there’s a stronger case to be made for avoiding registering packages that are clearly just a minimal implementation of some concept to get an idea out there. Those sorts of projects tend to change over a week when people start giving input.

I’m not trying to make a push for or against anything here. Just sharing my personal experience with this.

4 Likes

My feeling is only against name clashes. Ideally I would like that we could prefix package names by organization, as for example

using SciML/DifferentialEquations

Something like that would allow organizations to gain reputation and give reputation to packages, without introducing barriers for new contributors (they already can do that, but less explicitly)

I am imagining that there could be even name clashes when the package is registered bound to an organization. For instance someone could register something that would be refered as, for example,

using MyResearchGroup/DifferentialEquations

In the long run it is probably unavoidable that a lot, if not most, of packages in the register will be deprecated. It is sad that the prettier names will be taken.

9 Likes

Because of UUIDs it might be possible to recycle names eventually.

I think organization prefixes might more readily belong in the environment space. I’m not sure how differentiates two packages with the same name but different UUIDs.

If a package is officially abandoned one can use the same UUID and release a breaking version. That has happened already, I remember seeing one case. But it doesn’t feel that can happen safely except on rare cases.

I meant the another case. Imagine a long deprecated package that no ones uses anymore called VeryCoolPackage.jl, so we deregister it after ten years after recording the last dependent.

Later somone new creates a new package called very cool package with a new UUID. This should not be a problem since older projects could still reference the old package by the original UUID if needed.

As merely an (irregular) user, package fragmentation can make it confusing to search for packages. My suspicion is that the overall ecosystem could be healthier overall (=larger bus factor) if there were stronger tendencies to merge efforts instead of creating separate packages.
However, given that Julia and its ecosystem are quite “academic-affiliated” in nature, I suspect the incentives are a little aligned against that – I think it’s easier to get published when creating your own separate package than for “look, I made a series of big PRs against otherpackage”. :person_shrugging:

One example that I recently encountered was that I stumbled upon GitHub - SciML/GlobalSensitivity.jl: Robust, Fast, and Parallel Global Sensitivity Analysis (GSA) in Julia. It uses the “GSA” in its Readme, which I assumed to stand for “Global Sensitivity Analysis”.
Getting curious about the package name, and searching, I found GitHub - lrennels/GlobalSensitivityAnalysis.jl: Julia implementations of global sensitivity analysis methods., so that seems to have occupied a clearer package name already, so from here it looks like the GlobalSensitivity.jl authors were probably aware of the other package. GlobalSensitivityAnalysis.jl exists for one year longer, and has a similar amount of commits and activity. I’m wondered why a separate package was created, instead of pooling efforts.

From skimming the docs, it seems the SciML one has more methods implemented. No indication why those could not have been added as PRs to the already existing packages. I could not find any indication of an effort to join forces. The SciML one has a published paper, but its “Statement of Need” does not mention the existence of the other package.

At this point I deferred a deeper trade-off analysis until I really needed such a package. I walked away wondering if the maintenance situation (in the always maintainer-strapped OSS world) would not be better if people were joining efforts more (so yeah, this topic). If there were gentle incentives from the ecosystem/culture towards that, I guess this would not be a bad thing.
But then, the real world is complicated and full of humans. From another ecosystem and software area that is littered with small and subsequently abandoned projects, I already know that achieved de-fragmentation is a very hard thing. Maybe getting critical mass orgs (=>SciML?) and ending up dominating is the most promising avenue?

5 Likes