Fixing labeled array package fragmentation

That’s fair! In some cases, code inside these packages should be upstreamed to let them be more useful, e.g. anything that calculates interpolations. My point was more that the syntax for a time series package has a good reason to be different than code for a generic table. Like, if I call interpolate(df), it’s not at all clear what elements I should interpolate between, and this should probably error. Whereas a user calling interpolate(timeseries) almost certainly wants to interpolate between neighboring points in time.

So in this case, several packages with a common interface makes sense, because time series are a . It’s just like multiple dispatch: it makes perfect sense to dispatch on whether a table is a time series or a dataframe (because there’s lots of methods that make sense only for time series, and time series have a natural ordering).

On the other hand, creating several different packages based on whether you want to index into an array using x(), x[At()], or x[] doesn’t really make sense and just causes headaches for new users.

I think this trivializes the differences between those packages.
As a user of a single keyed arrays package, it’s difficult to tell what the differences actually are. But pretty sure they are deeper than surface syntax for indexing.

2 Likes

BTW, @Raf, I’d be more than happy to work on the PRs needed to merge DimData.jl with AxisArrays and AxisKeys–if I got confirmation from the other packages or developers that this would actually be accepted.

IME, though, trying to bring up specific examples of packages that should be merged just devolves into yet another round of bikeshedding. That’s why I started this thread; I wanted to discuss policies, governance structures, or tools that would help us avoid this kind of bikeshedding, and make it easier to merge redundant packages. Instead, a handful of users hijacked the thread, despite multiple attempts to keep it on the topic of how we can make the things you outlined easier.

Proposals on the Slack–where this discussion was substantially more productive than on the forum --included:

  1. An automatic system that flags possibly-redundant new packages and informs the maintainers of the conflicting packages, including a brief pause on registration to give them a chance to work this out.
  2. Requiring packages to meet basic standards to be registered in the General repository, like requiring a README, requiring packages have a designated backup maintainer or else belong to a Github org, etc.

Proposals I never got to discuss because this thread was hijacked:

  1. Mentioning Github’s submodules system in documentation for package developers, as a way to reduce the amount of “Clutter” in Github orgs. This should be included alongside a mention of how to create metapackages like StatsKit.jl.
  2. Designating a small arbitration committee that package developers could request the opinion of on this subject–thus my joke about a “Consolidation czar,” (a role Chris has played quite well), although I’d prefer such decisions to be taken by vote of an elected body.

I’m just as tired of this discussion as you are. I’d like to actually do something to merge some of these packages, instead of just discuss this ad nauseam. Consistently, though, the response I hear back from developers–and especially from this Discourse–is denying that a problem could even exist. (Although the constant complaints about this problem whenever I try teaching people about Julia suggest otherwise, and that the preponderance of redundant packages only stops being a problem if you’ve been in the Julia community since the very earliest days.) Under those circumstances, I’m not actually inclined to spend all my time making PRs that I expect are just going to end up rejected over miscolored bike sheds.

I’ve used all three major array-labeling packages extensively (AxisArrays, AxisKeys, and DimensionalData), and I’ve yet to notice anything that would make these packages incompatible/impossible to merge, except for differences in surface-level syntax. DimensionalData has a couple extra features like DimStacks, but none of these interfere with how AxisKeys works. (I believe there’s something in the design of AxisKeys and AxisArrays that makes it impossible to add all of DimData’s features, but not the other way around). Labeling axes really is just as simple as it sounds, which is the problem: it’s so easy that everyone has rolled their own slightly-different version of this.

As far as I can tell, the main reason all these packages are separate is because they have slightly different syntax to handle the ambiguous corner case where the user-assigned indices are integers.

When teaching, shouldn’t you just tell the students which packages to use and be opinionated about it yourself? If a student chooses to step out of the list you provide, the onus is on them to justify their choice. More generally, you may find my 2 comments here relevant. Personally, I can see that the “problem” exists but I also think that the solution from a user’s point-of-view is to have more people like yourself become opinionated about their favourite packages when teaching or using Julia. Then the most popular packages among teachers and package developers will naturally emerge as the de-facto standard.

I think we are discussing 2 related problems here. First is the problem of new user confusion which can be “fixed” by opinionated teachers. Second is the problem of resource scattering which can be “fixed” by more communication among people of similar interests. Whether this is done at package registration time or at conferences and meetups is an “implementation detail”. We just need to talk more to each other one way or another.

5 Likes

Well, the problem is I can’t be opinionated about which package to use, because everything is incompatible. It’s just not possible to use one (and only one) package. I encourage DimensionalData.jl (the most fully-featured and widely-supported) by default, but then you run into problems as soon as you have to use JuMP, which has its own named array type. Other parts of the ecosystem assume you’re using AxisArrays. And none of these named array types are compatible with the PPL ecosystem, which is sometimes incompatible with any kind of named array, sometimes only compatible with AxisArrays (MCMCChains), sometimes only compatible with AxisKeys (I wrote ParetoSmooth.jl before I knew about alternatives), and sometimes they use DimensionalData.jl (ArviZ).

In the end I’m usually forced to be opinionated about using Python instead, despite my preferences for Julia. In Python you have an end-to-end ecosystem that works with xarrays for any kind of stochastic optimization.

2 Likes

but then you run into problems as soon as you have to use JuMP, which has its own named array type

Again, not to derail this thread, but in that issue I linked to, I gave an example of how to use DimensionalData with JuMP:

Note that you don’t have to use JuMP’s built-in containers. There’s nothing special about them, other than that there’s default syntax for constructing them.

I’m just as tired of this discussion as you are. I’d like to actually do something to merge some of these packages, instead of just discuss this ad nauseam. Consistently, though, the response I hear back from developers –and especially from this Discourse –is denying that a problem could even exist

As one of the developers who has pushed back against unifying things, let me say: thank you for continuing to bang on about this. I agree it’s a problem. I just perhaps disagree that everyone switching to package ABC is the solution.

I think the hang-up with JuMP / higher level packages choosing smaller dependencies like DimensionalData over AxisArrays over XXX has always been that I dislike adding any dependencies, and so it wasn’t obvious where the JuMP<->DimensionalData code should live. One potential solution are the weak dependencies that are introduced in Julia 1.9. It seems pretty reasonable that we could add a JuMPDimensionalDataExt.jl dependency that would simplify a lot of things. Then JuMP could add integration support for DimensionalData without it being a direct dependency. (The integration is literally a two-line function: [Containers] support ArrayInterface.jl traits · Issue #3214 · jump-dev/JuMP.jl · GitHub)

So it may feel like Groundhog Day, but perhaps we’re closer to a resolution than you might expect.

4 Likes

just those two?, seems easy to do, the only thing missing would be tests?

I think what’s happening here is that you are one of a few people in the Julia ecosystem using so many cool packages together. Many of these packages were not written with compatibility with the other packages in mind. So their APIs are not compatible with each others’. The types of the inputs and outputs of a package are a part of its API after all. For instance, most JuMP users are probably happy using JuMP on its own and don’t care about integrating it with Turing or MCMCChains in their workflow. So first of all, let me congratulate you because this probably means that you are doing some very cool work spanning all these cool packages! However, I think this is also why there might be a bit of a disconnect between the pain you are feeling and the pain that the average JuMP or Turing user feels. So this is a case of “package A’s output is not compatible with package B’s input in an A → B workflow”. The solution is glue code like Oscar suggested for JuMP. The glue code can live in package A or B (as an extension) or in a separate workflow package C that connects packages A and B and streamlines the process of using them together. I think we may need more of these concrete examples of pain points and I am sure developers of packages A and B will be happy to have you onboard both as a user and a contributor. If not, roll out your own workflow package C.

5 Likes

Well, it’s not just mixing packages from different domains. The pain points show up just when using Turing.jl, ParetoSmooth (a TuringLang package), and ArviZ.jl, each of which demands users learn how to operate a different array-labeling package. (Hopefully sooner rather than later I’ll have the time to add anything missing from ArviZ, so I can deprecate ParetoSmooth for it.)

I totally understand how this kind of glue code is a good solution for situations where we have different packages because they’re each tailored for a different use case. On the other hand, if two packages do almost exactly the same thing, that’s bad. It means fewer people to catch bugs, make PRs, or open issues on each package. And more importantly, it means the work of supporting multiple packages has to be repeated for every package in the Julia ecosystem. Maybe it’s two lines of code for JuMP, but I wrote the ParetoSmooth code to work with AxisKeys specifically, and any attempt to make it compatible with DimensionalData would take quite a bit more work than that.

2 Likes

add JuMP extension, update package by longemen3000 · Pull Request #487 · rafaqz/DimensionalData.jl · GitHub PR to integrate DimensionalData and JuMP

I would be very cautious by recommending such an opinionated package with very involved container structs/types as the “default” choice.
Just checked, and some differences from AxisKeys still stand, same as a few years ago when I chose AK.jl for myself:

  • AK.jl keyed arrays are really more lightweight, just compare the types:
# AK.jl
KeyedArray{Float64, 2, NamedDimsArray{(:x, :y), Float64, 2, Matrix{Float64}}, Tuple{UnitRange{Int64}, UnitRange{Int64}}}

# DD.jl
DimArray{Float64, 2, Tuple{X{DimensionalData.Dimensions.LookupArrays.Sampled{Int64, UnitRange{Int64}, DimensionalData.Dimensions.LookupArrays.ForwardOrdered, DimensionalData.Dimensions.LookupArrays.Regular{Int64}, DimensionalData.Dimensions.LookupArrays.Points, DimensionalData.Dimensions.LookupArrays.NoMetadata}}, Y{DimensionalData.Dimensions.LookupArrays.Sampled{Int64, UnitRange{Int64}, DimensionalData.Dimensions.LookupArrays.ForwardOrdered, DimensionalData.Dimensions.LookupArrays.Regular{Int64}, DimensionalData.Dimensions.LookupArrays.Points, DimensionalData.Dimensions.LookupArrays.NoMetadata}}}, Tuple{}, Matrix{Float64}, DimensionalData.NoName, DimensionalData.Dimensions.LookupArrays.NoMetadata}

For one, the former directly uses whatever axis labels you provide, making it easier to understand and extend with custom axis label types.

  • Really, export and widely use common one-letter X, Y, Z identifiers? (DD.jl) This surely cannot be serious?

I’m all for using a single type/package everywhere, but in its current state DD.jl really feels like something crafted for a specific niche. It may have more features, but many usecases just need keyed arrays with axis labels and lookup, nothing more (:

5 Likes

I have no insight into the axis array situation but would like to know if there have been efforts towards an interface package in that area. Something whose abstractions would allow other packages to write their algorithms without knowing the exact backing type. An interface package can add opt in traits for any kind of behavior you might want, and then implementers could choose only to support a subset.

There was an attempt to do that in ArrayInterface, but didn’t seem to spark that much interest from AA/AK/DD devs.

That sounds like something the Turing org should address. In their own org they should be consistent on the preferred approach.

There’s a lot that could be said about what happened there. In short, there was a lot of movement related to fixing invalidations and load times so people could comfortably depend on it which took presidence over new features. Additionally, I was probably the one pushing that solution the most but had some recurring health issues last year and had to reprioritize my time. I’m still interested in that approach to unifying users’ experience interacting with those things. It’s just taking a bit of time to get back into things.

In case others are interested, there is still movement on this:

1 Like

Maybe this can serve as an example of why merging packages that appear “similar” is actually hard.

DimensionalData.jl is complicated so that it can represent netcdf, geotiff and similar objects in all (actually just most) of their complexity. AxisKeys.jl can’t represent these, and doesn’t need to at all for its use cases.

Personally, I do not push that DimensionalData.jl should be the main axis array package. It’s clearly over-engineered for the simple case, and still contains a lot of experiments (although axiskeys does too). It’s maybe only slightly over-engineered for the complicated cases.

For example, a netcdf file can have an irregularly spaced lookup index where the bounds of each pixel along each axis are explicitly specified. I have to represent that exactly, and be able to write it back to disk unchanged from the file it came from. This has to work through spatial subsetting, broadcasts, rotation, whatever. I also need to track the spatial bounds of the object, and that is specifically not the first and last value of the lookup most of the time.

None of the other packages can do these things, because their concept of what an axis/dim is is too simple. The flexibility of DimensionalData.jl also means that ArviZ.jl can quite easily wrap a python xarray stack.

But - a key point is - the other packages don’t need to do these things. They are completely usable and equally good for most other tasks, and a clearly simpler and better for a bunch of things.

@aplavin mentioned exporting X and Y being weird. In the spatial sciences (not some tiny niche by the way ;), virtually all of our data has X and Y axes, occasionally + Z and time. So it’s worth having them exported, we are typing that all day. I almost never use other custom dimension names in DimensionalData.

And yes, the types are complicated, probably too much. Although you may do a little disservice by including the submodule scoping in your example… those types are not exported by default for a reason:

DimArray{Float64, 2, Tuple{X{Sampled{Int64, UnitRange{Int64}, ForwardOrdered, Regular{Int64}, Points, NoMetadata}}, Y{Sampled{Int64, UnitRange{Int64}, ForwardOrdered, Regular{Int64}, Points, NoMetadata}}}, Tuple{}, Matrix{Float64}, NoMetadata}

But there are definitely a few things there I want to remove, but not that many can be.

If you can represent an ordered or unordered categorical axis, an ordered sampled axis representing regular or irregular points or intervals - and may have explicit interval bounds, or implicit centered, to the left or right of the axis values (not to mention axes that need CoordinateTransformations.jl to calculate indices from selectors) you just need to be able to distinguish them from each other and dispatch to different algorithms for selector lookups.

If you don’t need those things, why would you want any more than ranges as lookups, or work around the extra code and ridiculous amount of tests required to support it.

But without them I can’t use AxisKeys.jl or AxisArrays.jl for my daily work, without a massive addition of functionality that likely would not be accepted in PRs.

So we have multiple packages.

11 Likes

Yet at least DimensionalData.jl is actively supported and developed.

A lot of these packages fall by the wayside over time.

Abandonware

AxisArrays.jl
AxisSets.jl
AxisIndices.jl

Glacialware
Not abandoned, but fixes and enhancements occur at a glacial pace.

Axiskeys.jl
NamedDims.jl

Active

ArrayInterface.jl
DimensionalData.jl
NamedArrays.jl

3 Likes

Can’t it? Sure, not out-of-the box, but it’s reasonably extensible — even though this part is barely documented. For example, SkyImages.jl equips AxisKeys with celestial coordinates that can be defined in different ways: FITS WCS or Healpix.
Any reasonable AbstractVector can be axislabels there, and that’s lots of potential functionality!

Don’t see why it shouldn’t be possible with AxisKeys.

Interesting, I would think these are better as lat/lon for earth. In astronomy, I most often encounter ra/dec for axis names.
But the issue I raised isn’t in default naming — it’s exporting and widely relying on these names in the global scope.

That’s just how it prints, no editing! (:

Just define whatever special arrays are needed for these cases, and use them as axis labels when necessary! Otherwise, regular ranges or simple vectors are totally fine and enough for lots (arguably, the vast majority) of usecases.
AxisKeys support any abstractvectors as labels, and doesn’t convert/wrap them in extra internal types.

I imagine an ideal picture is lightweight LabeledArrays + specialized LabeledArraysGeo/Astro/… . Then, everything interoperates and has the same interface, but users in each niche are free to add more on top.

1 Like

11 Likes