Fixing labeled array package fragmentation

I think documenting that Makie+Plots are the most popular plotting packages would be great :smile: The plotting ecosystem is pretty good in this regard though–there’s ~2 major plotting packages.

I’m more worried about autodiff or array labeling, where there’s a dozen packages doing similar things, with half the features I need, and none of them are compatible. I’m not as familiar with the graph ecosystem, but from the thread it sounds like it has a similar problem.

This is also why I don’t think discoverability is a true substitute. I think the discoverability of Julia packages is great–JuliaHub has a comprehensive list that lets you sort or filter by stars!

There’s two separate problems here.

Too many things to look up

Even if each package is easy to find, this doesn’t mean it’s easy to find all the packages I need. Imagine breaking SplitApplyCombine into 3 packages (one for each functionality)–this would be a pretty bad idea! If you’re doing data wrangling, you probably want all 3 functionalities. Even if all 3 packages are easy to find, that’s still 3 times as many things to look up. Sklearn takes the extreme position of trying to bundle literally all of ML and statistics into one package, but the opposite extreme (imagine if GLM.jl had a separate package for each model!) is also a problem.

Redundancy

Package redundancy creates a lot of problems:

  1. User questions/help are split across a bunch of different repos, and I can’t apply most of the advice across packages, so I can’t find help when I need it.
  2. Documentation and tutorial effort is duplicated.
  3. I have no idea what users are going to want to use together with my package, which makes development hard. (Should I make my code interface with DimensionalData, AxisKeys, AxisArrays…)?
  4. If each individual package is rarely-used, there’s fewer people finding bugs. Code stays buggy and we end up with people giving up on the language because the ecosystem is underused.

Unfortunately, there’s always going to be a tradeoff here that we can’t magic away. Some parts of the ecosystem are in the right part of that tradeoff, but others aren’t, and I’m hoping we can fix the latter.

3 Likes

In the past I’ve tried very hard to help move the conversation related to your 3rd point on redundancy
forward. I’ve lost a lot of steam on that between some personal stuff I’ve had to deal with and the general lack of buy in from others on committing to a single standard.

Maybe it would help to have someone firmly make some decisions we all conform too, but realistically you’re describing a problem space that no language has actually solved. All other applications people bring up from other languages make compromises so people can just get work done eventually. Unless someone is hired full time to solve this very niche issue I’m not sure how to solve it without doing the same in some consolidation effort.

3 Likes

No language has solved it perfectly, but some languages mitigate it better than others. From what I can tell, Lisp actually ended up mostly solving this issue when people moved from Common Lisp to Clojure and Racket.

In the past, the solution I’ve suggested is adding array-labeling functionality directly to Base to break the impasse. A solution could also come in the form of the developers for the 3 biggest packages in this area (AA, DD, and AxisKeys) agreeing on which of these packages is going to be “the” single labeling package, and merging their contributions into one of them.

As the person who initially created (and then unfortunately abandoned passed along, I’m sorry everyone) AxisArrays, I can tell you with absolute certainty that doing array labeling in base would be the worst thing for having a canonical solution for array labeling. In my opinion, ND array labeling is a much harder and less-well-defined space than data frames — and it takes a lot of very hard work to get it right. Unlike DataFrames, ND-array labeling has long been silo’ed into hyper-opinionated domain-specific packages (of which my initial take on AxisArrays was one) in every language ecosystem I’ve touched. Unlike DataFrames, there hasn’t been a “solid” convention/API/standard to build from… at least not until recently. It’s only really in recent history that Python folks have started converging on xarray, for example. Can you imagine if we had demanded that DataFrames be incorporated into Base even just a year ago? Its development would’ve ground to a halt — and there has long been exceptional reference dataframe-like things from other languages to inspire its development, and even still they continue iterating at a much faster pace than Julia itself.

I think what you’re seeing isn’t so much fragmentation as it is exploration and the very hard work of trying to build something that works — sometimes that’s necessarily fragmented and opinionated, and sometimes it’s just works in progress. And that’s true throughout pretty much all of your examples.

29 Likes

If you’d like to put it somewhere other than Base, that’s fine, but I think the very first sentence of your reply really sums up the entire problem I’m talking about here.

If I have to choose between xarray (well-tested, well-documented, lots of support, stable, but an ugly interface) and DimensionalData (the opposite), I’m going to pick xarray every day of the week. Not just that, I’m frequently forced to pick Python over Julia for this single specific feature (a well-tested array labeling package that “just works” with PyMC).

Sure, go ahead, have works in progress if you’d like. But I can’t use a work in progress. At some point I need to actually use something, and that requires a stable and reliable standard. This is much more important to me than the difference between looking up elements with x[thing=2] and x(thing=2). That’s what Base is for, and if we can’t settle on a single alternative outside of it, just give me a basic implementation in Base.

3 Likes

And that problem is not fragmentation. It’s bandwidth, time, energy, and money. Xarray is a Sponsored Project of NumFOCUS with dedicated grant and development funding since 2020.

12 Likes

It was stable and working well long before 2020; it’s been a standard for longer than that.

1 Like

Would it be better to have a single “array labeling” package everyone uses and likes? Sure!

But is it required for efficiently working with labeled arrays already? Nope!
Both “modern” packages, AxisKeys and DimensionalData, have lots of users successfully utilizing them already. There is even AxisArrayConversion to easily convert between all three (those + older AxisArrays).

How does the current situation actually prevent one from utilizing this existing functionality?

7 Likes

The standard economies of scale–if people are using the same package, that means they’re asking the same questions and getting answers I can use, and I can focus on the same implementation (instead of having to verify my code works with every specific implementation). Probably the most important thing, though, is I can safely assume that most major bugs have been found and fixed. If economies of scale weren’t real, most people would’ve started using Julia a long time ago.

1 Like

I would say cvxpy is more or less the “centralized” convex optimization package for Python and yet a little big ago I hit a bug its interface to one of the solvers which Julia could call just fine.

When I read these takes from you it makes me feel like you haven’t spent that much time in the Python ecosystem, since I promise you will encounter bugs all the time. Even using xarray just a few months ago I hit a bug where open_mfdataset had a bug where it became unusably slow in certain contexts.

1 Like

The problem with development in Julia is that once you aren’t recreating some straightforward numeric algorithm you are in uncharted territory. You could probably spend a long weekend doing an implementation of xarray in Julia, but then why are you using Julia? It’s kind of like when people ask why we don’t just use transpilers to and from Python. The way many packages from other dynamic languages are written just don’t translate to code that communicates well with the compiler and the user.

Put another way. These other packages get to a certain point where developers can just agree that they’ve squeezed every bit of performance they can reasonable get out of it without just switching to another language. Then they start siloing off parts of the package into something completely static and you can’t really interact with those parts. The more generalized your application, the more difficult it is to get Julia code to the point where it’s clearly fully optimized and ready for v1.0. For example, table data seems like a fairly well defined data structure to develop but we still have separate packages for the situation that involves interactive data manipulation with many type instabilities (DataFrames.jl) and small well-defined tables (TypedTables.jl).

2 Likes

I fully understand that all ecosystems have bugs, but I run into them less often in Python than in the Julia ecosystem (part of the reason I, and many other Julia devs, find myself having to reach for Python more often than Julia despite liking Julia’s features better).

This is going off-topic, though. If you’re more interested in talking about my Python experience than the topic at hand feel free, but please do so on the Slack or in another thread. This thread is specifically for addressing ways to reduce package redundancy and the number of problematically small packages. If you think that number is exactly 0, that’s a perfectly fine opinion, but for anyone who thinks that number is positive, I’d like for it to be possible to have a useful discussion here.

1 Like

Pray do tell where to find a package that has zero bugs! The example of cvxpy was brought up: it has currently 175 open issues. If you find one with zero, I can tell you it hasn’t been used much.

2 Likes

most major bugs have been found

Actually, tables are a kinda special example. They are included in Base, implicitly.
That is, one doesn’t need any extra types to create and use tables, just arrays-of-namedtuples are enough and work perfectly fine for lots of cases (the majority? aside from very wide tables, I guess). Even loading Tables.jl itself in user code is optional, needed mostly for interoperability with other tabular packages.

The same isn’t possible for keyed arrays. I admit that even as a regular user of AxisKeys, I don’t really grasp all main differences/decisions/tradeoffs between the three packages — I just chose the modern one with the most lightweight data structure.
Would be nice if their authors got together and decided on a uniform interface. But again, this doesn’t prevent anyone from using any keyed arrays package already, they all are reasonably popular.

3 Likes

I have fatigue from these threads encouraging other people to go and change the ecosystem somehow.

People who care about fragmentation (or julia being better at LLMs etc… insert cause of the day here), I totally agree, it really is a problem Here are some ideas for you:

  • go and read the new packages feed, and make helpful comments where you think actual specific packages are similar
  • find two packages you care about that should be one package and do the work of making that happen, or learn why they actually are just different things
  • write up a document about similar packages and their dependencies and how they could integrate - being careful to find where the pain points and work will be. share it with the relevant devs to make their work easier.

The key point to keep in mind is merging packages will only happen if it somehow also helps the people who need to do the work, and they can clearly see how it will help them.

For example we are migrating YAXArrays.jl to be DimensionalData.jl/Rasters.jl compatable. That’s one less named array tool in the ecosystem and its happening right now.

But this is a bunch of actual work to do that because of course not all the concepts line up. It’s only happening because devs and users see their life being easy in future using DimensionalData.jl. AxisKeys.jl devs might not need so much from DimensionalData.jl, so I’m not sure why they should do the work.

Its the same for anything else that needs to change. Some person or group needs to do some actual work and be rewarded somehow or other for their time, by enjoying it, by being paid, or improving their own projects somehow.

27 Likes

Two major examples with fragmentation I’ve seen here are graphs and keyed arrays. Can the success of Tables be replicated there? Ie, not necessarily “a single package defining the type everyone uses”, but a lightweight interface package all relevant types implement.

I wonder if Tables ecosystem creators could share if they have ideas of how best to approach that. I don’t know about graphs, but there was an attempt to define general keyed arrays interface in ArrayInterface which for some reason didn’t really work out.

I don’t want to get into a discussion on the technical merits here (this thread is already long enough), but since keyed arrays keep coming up, here’s a recent discussion we had in JuMP: [Containers] support ArrayInterface.jl traits · Issue #3214 · jump-dev/JuMP.jl · GitHub.

The tl;dr of it is that despite a “keyed array” sounding like a simple extension to normal arrays, there are actually a bunch of orthogonal design choices that need to be made, and different package authors have come down on different sides of those choices. So it’s not a matter of time or effort to defragment the ecosystem, there are actually a bunch of similar packages with sufficiently different features that make it impossible to unify.

So I’m not sure we need to all decide on a single set of packages to use, or if trying to find a unified abstraction for keyed arrays is a good use of time (we’ve collectively spent a lot of time on this over the years… to little success). Use whatever is best for you, and if you’re a package author, pick an opinionated set of packages and write good documentation and tutorials teaching people how to use the things you’ve chosen. Popularity/success/momentum will win in the end.

11 Likes

I don’t think the second sentence follows from the first there. The question isn’t whether the packages have differences or made different design decisions (of course they do). It’s whether the packages are different enough that this compatibility headache is worth it for users.

For Tables, I think it is. There’s a lot of reasons to have different kinds of tables that do similar, but not equivalent, operations. For example, time series have a lot of different operations specific to them (e.g. interpolation, common models like autoregressive models, differencing); and StructArrays behave quite differently from DataFrames.

On the other hand, imagine if someone tried to create a completely new language that effectively copied every element of Julia’s semantics, then just made it zero-indexed, added semantic whitespace, and avoided (Breaking compatibility with Julia along the way). This would clearly be a pretty terrible idea. In the same way, every incompatibility among named array packages is effectively bikeshedding. Whether I have to index into arrays using x(1), x[At(1)], or x[1] doesn’t matter, except if one of these breaks compatibility with previously-existing code. Imagine if someone tried to do the Python equivalent of creating an entirely new package for array-labeling, just because they wanted to use x.sel(1) instead of x.isel(1). It’s exactly this kind of “Bikeshedding is resolved by creating a new implementaiton of the same functionality” that’s the problem.

Actually that’s my gripe for timeseries packages (:
All these operations make total sense for arrays/tables without requiring one dimension to be “time”. There are very few really time-specific operations, most of what’s found in “timeseries” packages is more general.

3 Likes