Fixing labeled array package fragmentation

What are these “lots of use cases” that DimensionalData.jl doesn’t work for, and does AxisKeys or any other actually-existing alternative work for those use cases?

Each of these two wrappers adds functionality explicitly requested by the user: array should be “named” and “keyed”. Nothing is redundant here, even though these two wrappers could potentially be merged into one.
Meanwhile, stuff in DD like NoMetadata, ForwardOrdered, Points, … is either implicit and not requested in any way, or redundant with something else in the type.

Enough to choose an alternative that does fine with cleaner structures and direct composability.
Same with Python: if Julia didn’t exist, I would use Py and generally be fine. But Julia does exist, and solves certain problems in a more straightforward and composable way.

Didn’t really have time to use xarray, but could you elaborate on numpy — what special indices type it has?
Also, Python generally is a poor guidance for composable solutions. Just look at units and uncertainties there! If xarray defines its own label types, it’s just because Python doesn’t allow relying on other packages the way Julia does.

Just interesting, can AxisKeys be added only to a single dimension? Can I add them for row indices in DataFrames or any other table?

You can have regular 1:n ranges as axiskeys for all other dimensions, so that selection from keyed array X goes like X(a=1.23, b=4.56, i=123) with 123 being just the regular 1-based index along the i axis.

What exactly do you mean by “row indices”? I think only IndexedTables.jl have any notion of row indices, but may be wrong.
I wonder what kind of interface/functionality you look for.
For example:

  • Put a keyed array as a table column – works totally fine with column-based tables (eg StructArrays, DataFrames). But what does it bring?
  • Interpret the whole keyed array as a table?

In the context of tabular operations, keyed arrays are very convenient for multidimensional groupbys/pivot tables:

julia> using AxisKeys, FlexiGroups

# create a simple table:
julia> X = [(a=rand('A':'E'), b=rand(10:15)) for _ in 1:100]
100-element Vector{NamedTuple{(:a, :b), Tuple{Char, Int64}}}:
 (a = 'D', b = 13)
 (a = 'D', b = 15)
 (a = 'A', b = 12)
 (a = 'B', b = 14)
 (a = 'B', b = 15)
 ...

# count all (a, b) pairs
julia> grs = groupmap(x -> (;x.a, x.b), length, X; restype=KeyedArray, default=0)
2-dimensional KeyedArray(NamedDimsArray(...)) with keys:
↓   a ∈ 5-element Vector{Char}
→   b ∈ 6-element Vector{Int64}
And data, 5×6 Matrix{Int64}:
         (10)  (11)  (12)  (13)  (14)  (15)
  ('A')     3     3     4     7     6     2
  ('B')     3     1     3     3     3     2
  ('C')     9     2     2     3     4     6
  ('D')     2     2     0     4     2     3
  ('E')     5     4     5     2     3     2

# access individual values...
julia> grs(a='A', b=12)
4
# ...and whole subarrays
julia> grs(a='A')
1-dimensional KeyedArray(NamedDimsArray(...)) with keys:
↓   b ∈ 6-element Vector{Int64}
And data, 6-element view(::Matrix{Int64}, 1, :) with eltype Int64:
 (10)  3
 (11)  3
 (12)  4
 (13)  7
 (14)  6
 (15)  2


# add total margins
julia> addmargins(grs, combine=sum)
2-dimensional KeyedArray(NamedDimsArray(...)) with keys:
↓   a ∈ 6-element Vector{Union{FlexiGroups.MarginKey, Char}}
→   b ∈ 7-element Vector{Union{FlexiGroups.MarginKey, Int64}}
And data, 6×7 Matrix{Int64}:
           (10)  (11)  (12)  (13)  (14)  (15)   (total)
  ('A')       3     3     4     7     6     2   25
  ('B')       3     1     3     3     3     2   15
  ('C')       9     2     2     3     4     6   26
  ('D')       2     2     0     4     2     3   13
  ('E')       5     4     5     2     3     2   21
  (total)    22    12    14    19    18    15  100

I mean having a table with first dimension (along rows) having some axis. For example, timeseries table with heterogenous column types (not a simple matrix). Knowing that it has timestamps sorted column, I can use it as index for optimal lookups and filters.

Is just 1d KeyedArray not enough for this case? You can store whatever elements inside, for example individual entries of the timeseries:

T = KeyedArray([(a=1, b=2), (a=3, b=4), ...], t=0.1:0.1:1)
T[1] == (a=1, b=2)  # first element
T(t=0.2) == (a=3, b=4)  # second element

Need column-major storage — put a StructArray inside a KeyedArray (whole column access won’t probably be very convenient in this case, but still possible).

I’m surprised at how inaccurate this discussion has been given how strong the opinions were.

Like this:

Each of these two wrappers adds functionality explicitly requested by the user: array should be “named” and “keyed”. Nothing is redundant here, even though these two wrappers could potentially be merged into one.

This is is an abstract idea of redundancy, and not how it works in practice. When you rewrap an array you have to re-implement a lot of code:

Like all of broadcasting:

And here chain rules:

We need a discussion about factual problems so that people can assess options accurately, not attempts to win arguments for a preferred option.

The clearest outcome of this discussion for me is that it’s best if in future it happen between developers who both know the codebases and their problems, and have a vested interest in reducing their own workload in the long term, and also usually don’t care as much.

3 Likes

If you see something important and inaccurate, don’t hesitate to at least point that out. As for that specific point, I don’t really see any inaccuracies there.

I don’t have to do that, and neither other users of the library. It’s only done once by the author, that’s it. Maybe it can even be simplified, but… it works!

From the PoV of the user, nothing is redundant in KeyedArray-of-NamedDimsArray. The first wrapper specifies that each dimension has axiskeys, the second that each dimension is named. These things are completely orthogonal.
An even simpler and more general approach would be

struct KeyedArray
    data:: <: AbstractArray
    axiskeys:: Tuple{<:AbstractVector} for unnamed dims, NamedTuple for named, allowing custom types for partially named
end

but AFAIK there are no packages like this.

Well… DimensionalData is like this, except it also has a bit of metadata.

Python is bad at composability in some cases and good in others. Both standardization (which makes inputs more predictable and simplifies testing) and language features like multiple dispatch can make composability easier. This thread started because the lack of standardization in Julia packages created a situation where basic features (like array-labeling) were not composable (among other problems).

Both xarray and Pandas implement their own index types because having a bit of extra metadata is useful–for example, in plotting.

If you think that’s the case, and the design of DimensionalData.jl could be simplified, feel free to make a PR. Alternatively, you can rewrite AxisKeys.jl to add the features in DimensionalData like NetCDF support and the extra flexibility. However, do note “ForwardOrdered” is definitely not redundant (consider a time series with unevenly-spaced points, which won’t be handled fully efficiently by AxisKeys), and neither is DimPoints.

It’s also worth noting that DimensionalData.jl totals 14k lines of code, which is pretty much equal to the 12k lines of code in AxisKeys+NamedDims+AxisSets (which taken together provide roughly the same features).

1 Like

And AxisKeys is also like this, except it has two wrappers and not one (: In the end, its structure is much lighter, as I shown before.

What does it mean for array-labeling to be composable?
Can it be solved with the Tables.jl-like approach of defining the same interface for basic operations?

Don’t think it’s possible to make a reasonable PR to a package that I’ve never used aside from a cursory look, and not familiar with its internals.

And get yet another keyed arrays package? (:
Also, I think AxisKeys is flexible and extensible as it is, at least it is enough for SkyImages to represent a wide variety of images one encounters in astronomy.

Surely it can be handled in AxisKeys, perfectly efficiently! Just use Julia composability powers, there’s no need to cram everything into one monolith package as Python does.

julia> using AxisKeys, UniqueVectors, BenchmarkTools

julia> xs = rand(10^4);
julia> ts = sort(rand(10^4));
julia> t = ts[1234]

# naive full-scan seach:
julia> X = KeyedArray(xs, t=ts);
julia> @btime $X(t)
  960.750 ns (2 allocations: 48 bytes)
0.25168764888077866

# optimized hash-based search:
julia> Y = KeyedArray(xs, t=UniqueVector(ts));
julia> @btime $Y(t)
  30.120 ns (2 allocations: 48 bytes)
0.25168764888077866

Same with all other custom wrappers: they don’t have to be there by default, keeping the basic structure lean and clean. Instead, they can live separately and used when needed.