Fixing Package Fragmentation

With regards to the specific case where there are multiple standards that do the same thing, I think there’s been a push to encourage creative development of solutions but then there’s a time where we should start consolidating and people are too busy (or maybe just uninterested) in the less interesting arduous work of going through each line of code and making it a single package.

2 Likes

This seems like a pretty flippant response for a common problem people have when trying to catalog, understand, and use the Julia ecosystem. Obviously having a single mega-package is a bad idea. It would also be a bad idea to make every single line of code into its own “Package.” If you’d like to argue the Julia ecosystem is too cohesive and monolithic, or at exactly the right level, I’d be happy to see you make that argument in another thread.

15 Likes

I’ll just copy paste here my response from the last time you posted on Slack saying that we should consolidate packages because you thought it’d make ChatGPT work better on julia code

It’s open source. People create things they find useful. Telling people to stop making packages with overlapping functionality because it’ll benefit LLMs is just absurd

The second sentence can be substituted for whatever other reasons you now have decided that we should stop making, or start deleting packages.

4 Likes

These conversations pop up every so often. It always comes down to discoverability and usability. Instead of having a consolidation czar, we could just as easily have a discoverability czar.

10 Likes

Can we split the conversation about whether consolidation is good into a different thread?

I’d bet there’s a lot of “import numpy, scipy, matplotlib” types of people out there - I’m one of them. From what I’m seeing SciML is the Julia equivalent of Scipy, at least from a documentation standpoint, and does a pretty good job of consolidating a bunch of packages into one place where they can be easily discoverable.

I’m not aware of an equivalent document for the plotting ecosystem. Matplotlib is the standard in Python so it doesn’t have this issue. Plots seems to be the de facto default plotting package, Makie seems to be up and coming for that position, at least from what can be gauged from the SciML docs, and it would be nice to have all other plotting packages in one document to aid in discoverability. I get why Julia Base can’t endorse Plots in its manual, but it should still be clear enough to a beginner that Plots is the go-to package (or maybe Makie). Maybe the plotting section in the SciML docs can be expanded for this purpose, but I don’t know if it’s the ideal location for it.

4 Likes

I think it’s part and parcel to a discussion on how to “fix” fragmentation, honestly. There are many tacks that can be taken to “fix fragmentation” here. For example:

  • The General registry should have a higher bar and be more curated.
  • There should be a Cathedral/Bazaar model where there’s a curated registry alongside the general one.
  • There need to be more documentation efforts to unify ecosystems (a la numpy/scipy/SciML) or other such curated lists (perhaps in the style of awesome-X or the like).

All of them would “fix” fragmentation, but they are all very different approaches and they would all require significant work and buy-in. For example, PyPI has an even lower bar than Julia’s General registry, but they solve this with monolithic packages. SciML doesn’t use monolithic packages, but instead has monolithic documentation.

It’s worth looking at some previous discussions: How to know which Julia package to trust? or How to know if a package is good?.

13 Likes

This kind of feels like two separate issues:

  1. Big packages that are broken up into several small ones with perhaps a top level package re-exporting those methods. Most of the time a single or small team of contributors are maintaining all of these packages.

  2. Many packages that are trying to achieve similar functionality (plotting, data formats, interfaces, AD) that are mainly developed by different people with only somewhat overlapping user bases.

I must admit I particularly don’t like the first approach where I have to look through many sub-packages to see where methods are called. In that sense I prefer the monolithic approach given by SciPy. It sometimes feels like a larger issue I have with Julia packages is that it can be much easier for me to read libraries like BLIS and SciPy quickly and understand what is going on and where things are coming from than some Julia packages. In some sense, this (for me) creates a lot of friction contributing to some of these packages. This could perhaps be part of the reason for the second issue is it can be easier and more rewarding to work on your own packages. I don’t think that is a bad thing!

Now, the ecosystem has done this mainly as a means to reduce latency (pre v1.9) which has worked very well. I have considered breaking up Bessels.jl into many small sub-packages for different functions (Airy, Bessel, Gamma) but I have been resistant as it personally is much easier for me to develop in a big mono-repo. I also find that it is easier to maintain a single documentation, a single place to discuss issues, maintain CI, attract users as well as pool contributors. I don’t think that would be similar if this was scattered amongst many different packages. It also provides some stamp of implementation quality. Like if an implementation is in SciPy I am pretty sure that it has some level of quality and testing.

I think the release of v1.9 has opened many different possibilities. I of course realize Bessels.jl is pretty much the perfect example of the advantages of the new caching code. I have shown this before but I’ll give a new example with the recent stable release.

# Version 1.8.2 (2022-09-29)
julia> @time @eval using Bessels
  0.030072 seconds (45.18 k allocations: 4.904 MiB)

julia> @time @eval airyai(1.2 + 1.1im)
  0.528501 seconds (962.46 k allocations: 48.968 MiB, 2.77% gc time, 99.91% compilation time)

# Version 1.9.0 (2023-05-07)
julia> @time @eval using Bessels
  0.030401 seconds (34.75 k allocations: 3.267 MiB)

julia> @time @eval airyai(1.2 + 1.1im)
  0.000290 seconds (62 allocations: 3.125 KiB)

Pre v1.9 I was seriously considering the cost of adding new functions to each version to the user and how it affected package load times and time to first function evaluation. Now, I am really just considering how many functions I should explicitly precompile and their affect on cache file size. I believe a lot of effort on v1.10 and beyond might be on reducing these file sizes.

Now, I am also aware that users might only use one function out of hundreds and so they are also paying for all of the other functions and features we are slowly adding. We have moved to a module base approach (that I guess in the future someone could split into subpackages) that isolates dependencies and functions more similar to SciPy. I guess we kind of ended up at the first approach with many submodules that are re-exported in the end :man_shrugging:

4 Likes

I think this is not fully a symptom of monolithicness / fragmentation, but also the fact that many common include and using patterns make source code discoverability challenging (no matter the package size)

4 Likes

I agree with the sentiment that there are multiple solutions and problems we really want to tackle here, but the hardest one is probably the one that truly does necessitate consolidation.

Personally, the term “consolidation czar” sounds a bit silly but perhaps its underlying meaning is more pragmatic than how it sounds. I don’t think a single person should be responsible for this across Julia because I don’t believe a single person has the expertise across that many things to do so. On the other hand, discussions on combining efforts come to a stand still and having one person with a definitive statement can be useful. I’m thinking of a sort of BDFL for areas of expertise that is willing to moderate and make firm decisions at a certain point.

1 Like

55 posts were split to a new topic: Fixing labeled array package fragmentation

Indeed, I actually often find this much easier to dissect and diagnose when it’s multiple separate packages strung together rather than one big package with nested includes.

This is a really great idea. I am somewhat doing that with @dilumaluthge and JuliaHealth this summer at the JuliaHealth BoF we are co-running (shameless plug, come visit!) and one of the major things will be thinking around interfaces to be commonly used within JuliaHealth while also supporting the development of other packages that could fit in the JuliaHealth umbrella.

5 Likes

Are these going to be broadcasted or have some sort of Zoom link?

1 Like

People discover them… They just discover either all of them or a random draw from them, and most are uncomplete and poorly maintained.

Consolidation isn’t required, just a signal that something is good enough to rely on - and with a commitment to testing. That is what SciML is doing and among the many talents of that organization, hardcore testing and documentation is central.

2 Likes

I wish these two forks could be merged back into one, but I don’t have any ideas for how to make it actually happen.

3 Likes

These are the sort of decisions I absolutely hate working on. It’s mostly bike-shedding until you reach a reasonable conclusion, then you have to spend the next months-year explaining over and over why they are seeing deprecation warnings and didn’t use the special syntax they would prefer.

4 Likes

I have experience both in choosing to work together and in preferring to not work together. Perhaps sharing my experience can help shed light on some ways forward.

For the former, let’s take GitHub - gdalle/ImplicitDifferentiation.jl: Automatic differentiation of implicit functions and GitHub - ThummeTo/ForwardDiffChainRules.jl as examples. The core of these 2 packages started as features in my GitHub - JuliaNonconvex/NonconvexUtils.jl: Some convenient hacks when using Nonconvex.jl. and they were not very well tested or documented yet. The current owners of the 2 packages above reached out and expressed interest in starting a package for each of those features. I supported that and started contributing to their packages because why not? The 2 projects turned out great and ended up with more users, stars, and PRs. Everyone is happy. In this case, the key was the natural coalescing of interests and the willingness to give up a little bit of control over a package in exchange for more dev time to be put into the package. Today I don’t control any of these packages and I am fine with being just a contributor.

The second example is GitHub - JuliaNonconvex/Nonconvex.jl: Toolbox for non-convex constrained optimization. vs GitHub - SciML/Optimization.jl: Mathematical Optimization in Julia. Local, global, gradient-based and derivative-free. Linear, Quadratic, Convex, Mixed-Integer, and Nonlinear Optimization in one simple, fast, and differentiable interface.. Optimization.jl has more pull because it’s a SciML project and it follows the standard SciML API. Many people like that. I don’t and that’s ok. Both Nonconvex.jl and Optimization.jl had similar development timelines and I could have chosen to contribute to Optimization.jl and abandon Nonconvex.jl at any point but I chose not to. My reason is that I wanted to have the ability to experiment with new APIs, new optimisation algorithms and AD hacks and I wanted to have full control over the package. In this case, I wasn’t willing to relinquish control and my interest to try new things in Nonconvex.jl did not naturally align with the interests of the Optimization.jl devs to follow the SciML API. There were more differences between the 2 packages as well so I am over-simplifying things a little.

I think if we can learn anything from these 2 experiences, it’s that when we have limited resources, getting together may end up being good for everyone. But no one should feel pressured to contribute to another package instead of starting their own. Having more channels where people of similar interests can reach out and collaborate would be a great thing. Perhaps JuliaCon should not be just a yearly thing. If there is a JuliaCon every 2 months online with different themes, I will be happy to listen in. This may be a way to break silos and get more people to reach out and collaborate.

23 Likes

And that’s totally fine! I think the two are moving in different directions and will complement each other.

The reason for not including Plots.jl is actually technical. There’s a redirect on its main page that puts you into the wrong site. It’s from a very early version of Documenter and I just haven’t patched that in the gh-pages of Plots.jl yet.

I think in general that’s fine, but I also think that in practice people don’t really understand the effort and patience required to maintain a package over years. I do see some packages that are nice small personal projects that I’m like “please put this in an org so it’s still up to date 4 years from now”. Simple things like bumping dependencies, reporting upstream regressions, continuing benchmarks, etc., It’s not hard but we do need to ensure that the satellite projects do keep that up. I think some people don’t move something into an org because they think it will look good on a CV or publication to be some personal project, but… you can publish and we can still maintain it :sweat_smile:. Its this unsexy work, continuing to maintain and improve the documentation, answering on Discourse and keeping an FAQ that grows over time, that’s the stuff that makes a package eventually mature, and we should be a bit more diligent about helping especially graduate students understand this longer term process.

This is why I’m so adamant that all GSoC projects are in orgs with other maintainers. Some day, maybe 10 years from now, someone may get bored or just busy, and when that day comes, having a structure to train the next maintainer is essential.

But then again, not every great project needs a large org and a lab. A lot of projects are perfectly fine as personal projects with one professor maintaining it for years.

10 Likes

I understand that but an author can also just add more contributors to the project with merge rights. Worst case scenario, an org can fork a project (or copy-paste it and give credit) if a project is abandoned but is still important to an org. So there is more than one solution to this problem. Educating people about maintenance burden is definitely beneficial though. Maybe everyone who wants to register a package can be asked to watch a video or read an article explaining what it takes to maintain a package long-term.

If there are too many abandoned packages, we may consider having a formal process of unregistering packages from the General registry. Another suggestion to “solve” the fragmentation issue for users is for some people to maintain a new registry. If I maintain a “SuperHighQualityRegistry” which has only packages that I deem to be of high quality, and if people trust my opinion enough, this registry may become popular. Then I can have a monolith documentation for all of the packages in my registry. Creating a new registry in Julia today is almost as easy as creating a new crypto, more people should do it!

4 Likes