Fixing Package Fragmentation

Taken from a Slack discussion:

I’ve had colleagues comment that fragmentation is currently the only really bad thing of the Julia ecosystem. They miss the numpy/scipy monolithic approach.

Now for most package ecosystems the bus factor probably follows some kind of power law, but the vibe I get with Julia is that the relationship between bus factor and popularity (and package complexity, cf. NPM micro-packages) is a lot weaker

I do think one actionable thing might be to more loudly mark certain packages as legacy / unmaintained. some of these “15 competing standards” for various niches haven’t really been updated in years, but they still add to the cognitive overhead when googling “julia package for X”

I thought about that, and what tools do we have beyond archiving GitHub repos / modifying readmes with big warnings? Can we add deprecation warnings to the General Registry?

I feel like we almost need a “Consolidation czar” for Julia whose job is just to set standard packages and merge ones that do the same thing

Maybe a good first step towards consolidation would be a slightly higher bar for registering packages. And if one of the maintainers of overlapping packages says “good idea, let’s add it to my package” the registration would be put on hold [temporarily].

Sometimes it’s hard to impossible to understand what’s different between similar packages. I had to make a few issues at different packages, sometimes got no answer at all (Difference from other packages · Issue #10 · qntwrsm/ProximalMethods.jl · GitHub and Differences from StructArrays · Issue #8 · m-wells/LazyTables.jl · GitHub), sometimes got some response but still not totally clear (Differences from StructArrays · Issue #106 · JuliaData/TypedTables.jl · GitHub).

Yeah, thats annoying and dissipates work effort.

Besides that problem, there’s the related problem of “Tiny packages that should be in base.” Why is InvertedIndices not in Base

So I thought I’d create a thread for proposals on fixing this problem.

12 Likes

Large packages require better communication and coordination than small packages, which can still be handled by individuals. Some individuals manage to create quite large packages but struggle to maintain them over the long term because they find no other people who are willing or able to help them. Package maintainers would need to spend more time on community building and developer documentation. Maybe the motivation to do this can come from specific “consolidation sessions” at JuliaCon where maintainers of related packages could meet, get to know each other and devise a road map for their particular niche of the ecosystem.

14 Likes

This person should start with car brands, 'cause I can easily recall more than 30 and I am not really into cars. And each brand has multiple models. It is a crazy world out there.

All of these have four wheels etc, apparently serving the same purpose. We should just have a STANDARD CAR. Preferably in STANDARD COLOR.

After this is achieved, we can merge packages. Into, preferably, one giant package (please bikeshed names here).

13 Likes

I think it’s worth looking at the work SciML is doing on this problem — in particular with its monolithic documentation that spans many many packages:

https://docs.sciml.ai/Overview/stable/

22 Likes

With regards to the specific case where there are multiple standards that do the same thing, I think there’s been a push to encourage creative development of solutions but then there’s a time where we should start consolidating and people are too busy (or maybe just uninterested) in the less interesting arduous work of going through each line of code and making it a single package.

2 Likes

This seems like a pretty flippant response for a common problem people have when trying to catalog, understand, and use the Julia ecosystem. Obviously having a single mega-package is a bad idea. It would also be a bad idea to make every single line of code into its own “Package.” If you’d like to argue the Julia ecosystem is too cohesive and monolithic, or at exactly the right level, I’d be happy to see you make that argument in another thread.

15 Likes

I’ll just copy paste here my response from the last time you posted on Slack saying that we should consolidate packages because you thought it’d make ChatGPT work better on julia code

It’s open source. People create things they find useful. Telling people to stop making packages with overlapping functionality because it’ll benefit LLMs is just absurd

The second sentence can be substituted for whatever other reasons you now have decided that we should stop making, or start deleting packages.

4 Likes

These conversations pop up every so often. It always comes down to discoverability and usability. Instead of having a consolidation czar, we could just as easily have a discoverability czar.

10 Likes

Can we split the conversation about whether consolidation is good into a different thread?

I’d bet there’s a lot of “import numpy, scipy, matplotlib” types of people out there - I’m one of them. From what I’m seeing SciML is the Julia equivalent of Scipy, at least from a documentation standpoint, and does a pretty good job of consolidating a bunch of packages into one place where they can be easily discoverable.

I’m not aware of an equivalent document for the plotting ecosystem. Matplotlib is the standard in Python so it doesn’t have this issue. Plots seems to be the de facto default plotting package, Makie seems to be up and coming for that position, at least from what can be gauged from the SciML docs, and it would be nice to have all other plotting packages in one document to aid in discoverability. I get why Julia Base can’t endorse Plots in its manual, but it should still be clear enough to a beginner that Plots is the go-to package (or maybe Makie). Maybe the plotting section in the SciML docs can be expanded for this purpose, but I don’t know if it’s the ideal location for it.

4 Likes

I think it’s part and parcel to a discussion on how to “fix” fragmentation, honestly. There are many tacks that can be taken to “fix fragmentation” here. For example:

  • The General registry should have a higher bar and be more curated.
  • There should be a Cathedral/Bazaar model where there’s a curated registry alongside the general one.
  • There need to be more documentation efforts to unify ecosystems (a la numpy/scipy/SciML) or other such curated lists (perhaps in the style of awesome-X or the like).

All of them would “fix” fragmentation, but they are all very different approaches and they would all require significant work and buy-in. For example, PyPI has an even lower bar than Julia’s General registry, but they solve this with monolithic packages. SciML doesn’t use monolithic packages, but instead has monolithic documentation.

It’s worth looking at some previous discussions: How to know which Julia package to trust? or How to know if a package is good?.

13 Likes

This kind of feels like two separate issues:

  1. Big packages that are broken up into several small ones with perhaps a top level package re-exporting those methods. Most of the time a single or small team of contributors are maintaining all of these packages.

  2. Many packages that are trying to achieve similar functionality (plotting, data formats, interfaces, AD) that are mainly developed by different people with only somewhat overlapping user bases.

I must admit I particularly don’t like the first approach where I have to look through many sub-packages to see where methods are called. In that sense I prefer the monolithic approach given by SciPy. It sometimes feels like a larger issue I have with Julia packages is that it can be much easier for me to read libraries like BLIS and SciPy quickly and understand what is going on and where things are coming from than some Julia packages. In some sense, this (for me) creates a lot of friction contributing to some of these packages. This could perhaps be part of the reason for the second issue is it can be easier and more rewarding to work on your own packages. I don’t think that is a bad thing!

Now, the ecosystem has done this mainly as a means to reduce latency (pre v1.9) which has worked very well. I have considered breaking up Bessels.jl into many small sub-packages for different functions (Airy, Bessel, Gamma) but I have been resistant as it personally is much easier for me to develop in a big mono-repo. I also find that it is easier to maintain a single documentation, a single place to discuss issues, maintain CI, attract users as well as pool contributors. I don’t think that would be similar if this was scattered amongst many different packages. It also provides some stamp of implementation quality. Like if an implementation is in SciPy I am pretty sure that it has some level of quality and testing.

I think the release of v1.9 has opened many different possibilities. I of course realize Bessels.jl is pretty much the perfect example of the advantages of the new caching code. I have shown this before but I’ll give a new example with the recent stable release.

# Version 1.8.2 (2022-09-29)
julia> @time @eval using Bessels
  0.030072 seconds (45.18 k allocations: 4.904 MiB)

julia> @time @eval airyai(1.2 + 1.1im)
  0.528501 seconds (962.46 k allocations: 48.968 MiB, 2.77% gc time, 99.91% compilation time)

# Version 1.9.0 (2023-05-07)
julia> @time @eval using Bessels
  0.030401 seconds (34.75 k allocations: 3.267 MiB)

julia> @time @eval airyai(1.2 + 1.1im)
  0.000290 seconds (62 allocations: 3.125 KiB)

Pre v1.9 I was seriously considering the cost of adding new functions to each version to the user and how it affected package load times and time to first function evaluation. Now, I am really just considering how many functions I should explicitly precompile and their affect on cache file size. I believe a lot of effort on v1.10 and beyond might be on reducing these file sizes.

Now, I am also aware that users might only use one function out of hundreds and so they are also paying for all of the other functions and features we are slowly adding. We have moved to a module base approach (that I guess in the future someone could split into subpackages) that isolates dependencies and functions more similar to SciPy. I guess we kind of ended up at the first approach with many submodules that are re-exported in the end :man_shrugging:

4 Likes

I think this is not fully a symptom of monolithicness / fragmentation, but also the fact that many common include and using patterns make source code discoverability challenging (no matter the package size)

4 Likes

I agree with the sentiment that there are multiple solutions and problems we really want to tackle here, but the hardest one is probably the one that truly does necessitate consolidation.

Personally, the term “consolidation czar” sounds a bit silly but perhaps its underlying meaning is more pragmatic than how it sounds. I don’t think a single person should be responsible for this across Julia because I don’t believe a single person has the expertise across that many things to do so. On the other hand, discussions on combining efforts come to a stand still and having one person with a definitive statement can be useful. I’m thinking of a sort of BDFL for areas of expertise that is willing to moderate and make firm decisions at a certain point.

1 Like

55 posts were split to a new topic: Fixing labeled array package fragmentation

Indeed, I actually often find this much easier to dissect and diagnose when it’s multiple separate packages strung together rather than one big package with nested includes.

This is a really great idea. I am somewhat doing that with @dilumaluthge and JuliaHealth this summer at the JuliaHealth BoF we are co-running (shameless plug, come visit!) and one of the major things will be thinking around interfaces to be commonly used within JuliaHealth while also supporting the development of other packages that could fit in the JuliaHealth umbrella.

5 Likes

Are these going to be broadcasted or have some sort of Zoom link?

1 Like

People discover them… They just discover either all of them or a random draw from them, and most are uncomplete and poorly maintained.

Consolidation isn’t required, just a signal that something is good enough to rely on - and with a commitment to testing. That is what SciML is doing and among the many talents of that organization, hardcore testing and documentation is central.

2 Likes

I wish these two forks could be merged back into one, but I don’t have any ideas for how to make it actually happen.

3 Likes