Fixing labeled array package fragmentation

@aplavin isnt a Named array dev :wink:

Don’t see why it shouldn’t be possible with AxisKeys

Because not every layer has to have the same dimensions, just a subset of the total. Then every operation has to be able to work over mixed dimensional, and you need something like broadcast_dims, which AxisKeys.jl doesn’t have.

Interesting, I would think these are better as lat/lon for earth.

You would if you didn’t use spatial data very much. The problem is a lot of the time your units are not lat and lon, so its pretty confusing to do that. We use projections. We also use rotations, so…

That’s just how it prints, no editing! (:

If you don’t import the submodules :wink: And you were arguing about type complexity, not printing length.

Just define whatever special arrays are needed for these cases, and use them as axis labels when necessary! Otherwise, regular ranges or simple vectors are totally fine and enough for lots (arguably, the vast majority) of usecases.

Not really… there are some things like knowing the order of a vector and that its actually sorted that let you use searchsortedfirst appropriately, or fall back to findfirst for unordered. Do you want to check sorting on every lookup? DimensionalData.jl checks that on construction.

I imagine an ideal picture is lightweight LabeledArrays + specialized LabeledArraysGeo/Astro/… . Then, everything interoperates and has the same interface, but users in each niche are free to add more on top.

In your imagination any code can exist, my whole point is someone has to bother to write it and make it work for everyone and that’s actually work - are you doing that?

I’d be very happy with that solution. If there’s extra functionality in DimensionalData.jl that makes the package too heavy, I’d have no objections to a “GeoArrays” package that enhances DimensionalData. Ironically, I think there’s several packages doing something similar. It’s good to have separate packages for separate functionality. The problem is having half-a-dozen implementations that duplicate basic utilities like array labeling.

That being said, I don’t think there’s much difference in how lightweight these packages are anymore. TTF array is ~0.5 seconds for both of them:

AxisKeys
julia> @time using AxisKeys; x = randn(3); KeyedArray(x; height=[:a, :b, :c])
  0.475163 seconds (577.99 k allocations: 40.207 MiB, 3.88% compilation time)
1-dimensional KeyedArray(NamedDimsArray(...)) with keys:
↓   height ∈ 3-element Vector{Symbol}
And data, 3-element Vector{Float64}:
 (:a)   0.9580481487161372
 (:b)  -0.3443816889125824
 (:c)  -0.6400398065084122
DimensionalData
julia> @time using DimensionalData; x = randn(3); DimArray(x, (; height=[:a, :b, :c]))
  0.609328 seconds (1.07 M allocations: 55.942 MiB, 3.26% gc time, 3.00% compilation time)
3-element DimArray{Float64,1} with dimensions: 
  Dim{:height} Categorical{Symbol} Symbol[:a, :b, :c] ForwardOrdered
 :a  -0.958091
 :b   0.272903
 :c  -0.255594

Actually, I suggested DimensionalData for ArviZ.jl specifically on the basis that it seemed more lightweight. And it was, at first–but then DimensionalData added all the functionality in AxisKeys, because we needed it for ArviZ and everything else! The use cases for these packages overlap so much that whenever there’s a feature missing from one, it’s quickly added to the others.

Features should be broken up across different packages when there’s very little overlap in users of the two features. In this case, the two packages started off different, but ended up converging to almost-identical feature sets because of the heavy overlap in the required functionality; I think that’s very strong evidence that these two packages are trying to do the same thing. By way of comparison, I don’t think anyone’s proposing to add time series methods to DataFrames.jl.

There’s other differences for sure, like exporting X, Y, Z, and Ti. But given this convergence, I’m doubtful that maintaining two completely parallel ecosystems for array-labeling is a better solution than just exporting slightly fewer names. (If this is common in spatial sciences, maybe GeoStats.jl should export them instead?)

Rasters.jl could totally export X/Y/Z instead of DimensionalData.jl (GeoStats.jl is an unrelated package that doesn’t use any of this)

But the indexing wont go in Rasters.jl because other spatial packages need netcdf support too (like the ability to have it, not the dep)

I think that functionality is also in AxisKeys.jl, via AcceleratedArrays.jl.

1 Like

Yet? (: // hopefully not

That’s exactly the nice usecase for allowing arbitrary arrays without converting/wrapping them!
User passes regular vector? findfirst(==(x), xs) does linear search.
Range? findfirst(==(x), xs) finds the index analytically.
Sorted array (eg UniqueVectors.jl or AcceleratedArrays.jl)? findfirst(==(x), xs) does binary search.
That’s what Julia and its composability gives you, basically for free. No special casing in the labeled arrays package. And that’s what AxisKeys.jl do.

1 Like

By “lightweight” I don’t mean TTFX or even number of dependencies — personally, I rarely restart julia and just don’t care much about TTFX while it’s below a few seconds.
My point was about actual types being lightweight and straightforward, without fancy wrappers on every level.

1 Like

I’m getting deja vu from my discussions with the Mojo devs :smile: (“Listen, Mojo is totally different from Julia and the two approaches could never be combined, because we need [feature that’s already in Julia]”)

Isn’t AxisKeys a wrapper around a wrapper around an array, with lots of types parametrized at each level? I didn’t find the types in DimensionalData any more complicated than those in AxisKeys when I worked with them–actually, I found DimensionalData’s types a bit easier to work with.

Just compare these:

AK is clearly fewer wrappers and fancy types – none aside from the keyed + named array itself.

If you just look at the length of the type signature in characters, I guess it looks like that. But the only real difference here is that KeyedArray is displaying it a bit more compactly. The information that DimArray is displaying (Regular, ForwardOrdered) is provided by the type signature UnitRange.

The number of layers/conceptual complexity here is the same; DimensionalData just wraps a vector for each dimension, so the type signature shows up multiple times, while AxisKeys wraps one array in another array.

There might be a way to slightly reduce the complexity of DimensionalData here by trying to offload a bit of the work onto other packages, but it’s definitely not a huge deal whether you have an extra type for dimensions or not.

It’s definitely possible to use the passed arrays as they are, without wrapping them. So why do that, complicating their types and the overall data structure?
AK is just more transparent here, doesn’t introduce concepts that aren’t necessary, and facilitates interoperability between packages.

I think I could say the same thing for AK wrapping "KeyedArray"s around a “NamedDims” array :sweat_smile:

But how much does having this one extra type in the type tree matter? Enough to justify a completely separate package that reimplements the same functionality from scratch? Does having 2 packages for this really reduce the amount of complexity relative to just living with having an extra wrapper for dimensions?

It can’t be disastrous. Pandas/NumPy/xarray all follow the same approach as DimensionalData.jl (they have a separate type for indices) and I don’t think anyone’s really complained about it.

Apologies, I don’t have time right now for a lot of back and forth on this but @Raf’s description was fairly accurate. I’d only add that AxisKeys.jl may seem like the right solution superficially to users but has some problems under the hood because the semantics of how we use them need to translate to the internals used for indexing in a flexible, performant, and intuitive way. These have been discussed over the years and it would take a lot of time to dig up all the links and fully present them to new comers. It’s somewhat akin to the conversation of “why did you make Julia instead of just fixing Pythong?”. When you spend enough time trying to solve the problem you realize there are some things that just can’t be made flexible, performant, and intuitive to users. It’s going to take a lot of dedicated work to find one solution that works. Some of it requires adding things to Base still. Unless someone makes it their full time job to figure this out over the next year, don’t expect this to be solved by a single package anytime soon.

With that said, we can still have stable interfaces around this. Having a common interface to accessing labeled indices and named dimensions makes it so package authors can guarantee a subset of behaviors without the risk of breaking changes in the future.

Are there any problems with DimensionalData?

Also I can’t stop thinking of a snake in a thong now

1 Like

I think that it was already expressed by the author quite concisely. It’s over engineered and is not a well fit solution for a lot of other use cases.

BTW. I’m leaving that typo in just for you :grin:

If DimensionalData.jl is overengineered, that makes me think the bar for “Overengineered” is set low enough that even a Pythong couldn’t fit under it

No it couldn’t. Those solutions wouldn’t work for a lot of use cases either. Just because people make due with the solution their given doesn’t mean it’s the right solution.

Since this thread got split from the main one (thanks @mbauman), I’ll just note that one concrete thing to come out of this discussion is @longemen3000’s PR to JuMP:

We had a call with the JuMP developers a few hours ago, and the conclusion was a tentative “yes” to accepting more extension packages like this.

2 Likes