Commit to a common syntax for accessing additional index mapping information on axes

If you use AxisArrays.jl it’s axisvalues(data, dim), if you use AxisKeys.jl it’s called axiskeys(data, dim), if you use DimensionalData.jl it’s dims(data, dim).val, if you use NamedArrays.jl it’s `names(data, dim). Whatever you call it, it would be great if we had a consistent way of extracting it for end users because we’ve been talking about it and implanting it in different ways for years now (WIP: The Plan · Issue #1 · JuliaCollections/AxisArraysFuture · GitHub). For quite a while I’ve wanted to implement this syntax in ArrayInterface.jl and have recently made active efforts to do so (`index_labels` method and `IndexLabel` type indexing by Tokazama · Pull Request #328 · JuliaArrays/ArrayInterface.jl · GitHub, `axis_keys` · Issue #250 · JuliaArrays/ArrayInterface.jl · GitHub). However, we need to agree on a name for this.

The ambiguity in naming may mislead some new comers to this conversation, so I want to be very clear what we want to access with this new syntax. (for the sake of unbiased examples here I’ll just call it newsyntax)

  • The return of newsyntax(data) is a direct mapping to axes(data). Therefore, the following properties should hold:
    • length(newsyntax(data)) == ndims(data)
    • length(newsyntax(data, dim)) == length(axes(data, dim))
    • axes(data, dim)[index] == findfirst(==(newsyntax(data, dim)[index]), newsyntax(data, dim))
    • allunique(newsyntax(data, dim))
  • This is NOT intended for general information that may be mapped to axes (even if the above properties are valid). Something like the description of each each column in a data set is not the intended use case.

en lieu of fully reviewing previous conversations, here’s a quick overview of what has been discussed previously concerning this:

  • Something with “keys” has a very close meaning to what we’re trying to access with this new syntax. However, keys(::AbstractArray) already exists, and there’s fairly compelling reasons for the current behavior (https://github.com/JuliaLang/julia/pull/36073). So if we use “keys” in the name it needs to be clearly differentiated from keys(data) somehow.
  • “names”: already used for something with a completely different meaning in Base and a bit awkward when the return isn’t clearly a naming something (e.g., time-points).
  • “labels” is also close in meaning to what we are interested in here. It’s a bit less clear that it is intended to support an extra mapping that has a key => index relationship and LaballedArrays.jl already has a bit of a claim on “labels” (so we should get the go ahead from @ChrisRackauckas that this would gel with the SciML ecosystem first).
  • We’ve yet to explore a lot of other options (tags, tokens, markers, etc.). But the exact thing we want shouldn’t conflict with more appropriate use of the term (GitHub - andyferris/Dictionaries.jl: An alternative interface for dictionaries in Julia, for improved productivity and performance).

It would be extremely helpful to get some feedback on what works best overall, so please vote below. If you choose to vote for one of the following options but would prefer some variant of what’s provided, please comment.

  • axiskeys
  • names
  • labels
  • Other (please comment)

0 voters

Thank you!

2 Likes

I’ve used these packages before but they’re not on the top of my head. It would be easier to respond to this given some examples using the function as well as some of the related functions.

1 Like

The Images docs have some examples where this would return time points for the time axis and spatial positions for the spatial axes. I believe @Raf uses this feature in DimensionalData.jl primarily for referring to geographical data (which could do something like newsyntax(data) = (latitude, longitude)). A number of spatiotemporal features can be derived from this such as spatial spatial intervals (also see References · JuliaImages). I think AxisKeys.jl is used by some statistical packages to label model parameters along an axis and iterations along another. In the simplest case this would provide some mapping between feature names and data just as DataFrames does.

I’ll try making a simple example of using keyed arrays and these functions.
Suppose you have measured counts at different times (time=10, 20, 30) and with different polarizations (L and R). This data is naturally represented as a keyed array:

julia> A = KeyedArray([1 2 3; 4 5 6], pol=[:L, :R], time=10:10:30)
2-dimensional KeyedArray(NamedDimsArray(...)) with keys:
↓   pol ∈ 2-element Vector{Symbol}
→   time ∈ 3-element StepRange{Int64,...}
And data, 2×3 Matrix{Int64}:
        (10)  (20)  (30)
  (:L)     1     2     3
  (:R)     4     5     6

The function AxisKeys.axiskeys accesses the “key” values for each axis:

julia> axiskeys(A, :pol)
2-element Vector{Symbol}:
 :L
 :R

julia> axiskeys(A, :time)
10:10:30

julia> axiskeys(A)
([:L, :R], 10:10:30)

Also, there is a convenient named_axiskeys function, not sure if it’s in scope here:

julia> named_axiskeys(A)
(pol = [:L, :R], time = 10:10:30)

These examples use AxisKeys.jl, and @Zach_Christensen discusses generalizing the function to alternative implementations of keyed arrays.

3 Likes

Among the poll variants, I think both labels and names are too specific. As you say, it would be weird to call fundamentally continuous values – times, distances, or latitudes – “labels” or “names”. Keys/axiskeys/axisvalues sound more abstract and general.

Interestingly, @mcabbott (the author of AxisKeys.jl) doesn’t love axiskeys and suggested “labels”.

Anyway, standardization itself is more important than name differences between several reasonable variants. For example, one could argue that Base.keys should really be indices instead, and this axiskeys should just be keys. But it takes little time to get used to either.

1 Like

I completely agree. There are some names that sound a bit better to me but I’m mostly interested in getting something standardized that we can build on.

Thanks for taking the intiative. It would indeed be good to standardize this. BTW, it would also be useful to have a standard API to build such arrays and to convert from one type to another. That would allow e.g. using FreqTables.jl to create a frequency table stored in an AxisArray instead of a NamedArray (which is the only supported type currently).

Regarding the function name, axiskeys isn’t distinct enough from keys (as you note). keys already returns keys for each axis, so adding “axis” to the name doesn’t add anything IMO. Maybe something like axisnames/axislabels or dimnames/dimlabels? dimnames is used by R FWIW.

Also note that several packages support giving names to dimensions themselves (i.e. a single name attached to dimension N, as opposed to one name for each slice on that dimension) so it would be good to ensure we can find a name for that too (like dimnames in NamedArrays). One solution is simply to have newsyntax(data) return a named tuple whose names are dimension names.

2 Likes

Can you explain this distinction? I though I understood what this is about until this paragraph. Is this paragraph trying to make a distinction between indexing the data through the keys/labels/values vs. other uses?

Of the mentioned packages, I’ve only ever used NamedArrays to label row and columns (as well as block-rows and block-columns when also using BlockArrays, by labeling the row and columns of blocks(matrix::BlockMatrix)), those labels being later used when viewing the data (but not as keys for accessing it). If I understand correctly, this seems to be a use case that you dismiss here (“description of each column in data set”). In the example by @aplavin below, :L and :R can also be seen as description of each row in a data set. Are these uses not what this thread is about?

I agree, but I’m trying to take solve this in small agreeable steps so we don’t continue generating new packages for the same thing here.

axislabels makes sense to me since it implies a relationship with axes.

I’m not saying this “key” can’t be descriptive but each should be unique from all other values in the collection and typically used as a a mapping into the space. So a data dictionary wouldn’t really be the intended use case here because no one is going to look up a value with a long sentence. It also wouldn’t make sense to have non unique values just as you wouldn’t have multiple columns with the same name in a table.

I said this application was “not intended” for general information. I chose that wording very carefully because:

  1. I don’t want to get too distracted by every possible application loosely related to this. We have been trying to get an AxisArrays type system standardized for years and this is only one small step toward solving this.
  2. “not intended” is more permissive than “forbidden” or “disallowed”. If you want to attach a stanza to each element of your axislabels/axiskekeys/etc that’s fine. Just don’t expect any special effort to be put into optimizing and debugging problems when used for newsyntax(data) that returns that.

If you have a related but diverging application you think will benefit the entire community and needs to be discussed more thoroughly, I’d love to discuss it in a new topic or github issue. It might be a good application for the metadata API in development.

2 Likes

DimensionalData.jl actually has a method lookup(A, 2) for this, and the objects are called LookupArrays. This came from trying to avoid key and index, and label for being too generic.

2 Likes

If you think there’s a better name there I’m open to it. Like axislookups

Pandas uses labels and levels.

@Zach_Christensen in addition to naming, there is also a decision regarding the discussed function behavior. Should it return NamedTuples when the dimensions themselves are named, or the result should always be a Tuple?
I personally like the first variant more: getting “axis keys” together with dimension names is definitely a useful piece of functionality, and having two separate function (Tuple and NamedTuple ones) seems like an unnecessary increase of API surface.

1 Like

We have a dimnames method already and the convention is to have :_ be an unnamed dimension. Multiple unnamed dims wouldn’t work with a NamedTuple.

Maybe we could figure out a work around for this. I might be able to put together a small package for this if it’s really necessary.

We could probably figure out how to support multiple levels at some point but I think that implementation has issues. Not enforcing unique keys makes it ambiguous wether one is accessing an element or collection, not to mention performance issues. But I am interested in that sort of thing once we have these syntax in place.

Im in favour of returning NamedTuple to get both the dim names and lookup values from one method.

But yes unnamed dimensions is a problem for that, unless we adopt a numbering convention.

Extents.jl uses a NamedTuple-like wrapper object for a similar purpose, combing dimension names and extent bounds in one object. Returning NamedTuple was avoided to make dispatch easier.

We would likely have two methods if we support that. One for a names version and one that returns a plain tuple.