The ambiguity in naming may mislead some new comers to this conversation, so I want to be very clear what we want to access with this new syntax. (for the sake of unbiased examples here I’ll just call it newsyntax)
The return of newsyntax(data) is a direct mapping to axes(data). Therefore, the following properties should hold:
This is NOT intended for general information that may be mapped to axes (even if the above properties are valid). Something like the description of each each column in a data set is not the intended use case.
en lieu of fully reviewing previous conversations, here’s a quick overview of what has been discussed previously concerning this:
Something with “keys” has a very close meaning to what we’re trying to access with this new syntax. However, keys(::AbstractArray) already exists, and there’s fairly compelling reasons for the current behavior (https://github.com/JuliaLang/julia/pull/36073). So if we use “keys” in the name it needs to be clearly differentiated from keys(data) somehow.
“names”: already used for something with a completely different meaning in Base and a bit awkward when the return isn’t clearly a naming something (e.g., time-points).
“labels” is also close in meaning to what we are interested in here. It’s a bit less clear that it is intended to support an extra mapping that has a key => index relationship and LaballedArrays.jl already has a bit of a claim on “labels” (so we should get the go ahead from @ChrisRackauckas that this would gel with the SciML ecosystem first).
It would be extremely helpful to get some feedback on what works best overall, so please vote below. If you choose to vote for one of the following options but would prefer some variant of what’s provided, please comment.
I’ve used these packages before but they’re not on the top of my head. It would be easier to respond to this given some examples using the function as well as some of the related functions.
The Images docs have some examples where this would return time points for the time axis and spatial positions for the spatial axes. I believe @Raf uses this feature in DimensionalData.jl primarily for referring to geographical data (which could do something like newsyntax(data) = (latitude, longitude)). A number of spatiotemporal features can be derived from this such as spatial spatial intervals (also see References · JuliaImages). I think AxisKeys.jl is used by some statistical packages to label model parameters along an axis and iterations along another. In the simplest case this would provide some mapping between feature names and data just as DataFrames does.
I’ll try making a simple example of using keyed arrays and these functions.
Suppose you have measured counts at different times (time=10, 20, 30) and with different polarizations (L and R). This data is naturally represented as a keyed array:
julia> A = KeyedArray([1 2 3; 4 5 6], pol=[:L, :R], time=10:10:30)
2-dimensional KeyedArray(NamedDimsArray(...)) with keys:
↓ pol ∈ 2-element Vector{Symbol}
→ time ∈ 3-element StepRange{Int64,...}
And data, 2×3 Matrix{Int64}:
(10) (20) (30)
(:L) 1 2 3
(:R) 4 5 6
The function AxisKeys.axiskeys accesses the “key” values for each axis:
Among the poll variants, I think both labels and names are too specific. As you say, it would be weird to call fundamentally continuous values – times, distances, or latitudes – “labels” or “names”. Keys/axiskeys/axisvalues sound more abstract and general.
Anyway, standardization itself is more important than name differences between several reasonable variants. For example, one could argue that Base.keys should really be indices instead, and this axiskeys should just be keys. But it takes little time to get used to either.
I completely agree. There are some names that sound a bit better to me but I’m mostly interested in getting something standardized that we can build on.
Thanks for taking the intiative. It would indeed be good to standardize this. BTW, it would also be useful to have a standard API to build such arrays and to convert from one type to another. That would allow e.g. using FreqTables.jl to create a frequency table stored in an AxisArray instead of a NamedArray (which is the only supported type currently).
Regarding the function name, axiskeys isn’t distinct enough from keys (as you note). keys already returns keys for each axis, so adding “axis” to the name doesn’t add anything IMO. Maybe something like axisnames/axislabels or dimnames/dimlabels? dimnames is used by R FWIW.
Also note that several packages support giving names to dimensions themselves (i.e. a single name attached to dimension N, as opposed to one name for each slice on that dimension) so it would be good to ensure we can find a name for that too (like dimnames in NamedArrays). One solution is simply to have newsyntax(data) return a named tuple whose names are dimension names.
Can you explain this distinction? I though I understood what this is about until this paragraph. Is this paragraph trying to make a distinction between indexing the data through the keys/labels/values vs. other uses?
Of the mentioned packages, I’ve only ever used NamedArrays to label row and columns (as well as block-rows and block-columns when also using BlockArrays, by labeling the row and columns of blocks(matrix::BlockMatrix)), those labels being later used when viewing the data (but not as keys for accessing it). If I understand correctly, this seems to be a use case that you dismiss here (“description of each column in data set”). In the example by @aplavin below, :L and :R can also be seen as description of each row in a data set. Are these uses not what this thread is about?
I’m not saying this “key” can’t be descriptive but each should be unique from all other values in the collection and typically used as a a mapping into the space. So a data dictionary wouldn’t really be the intended use case here because no one is going to look up a value with a long sentence. It also wouldn’t make sense to have non unique values just as you wouldn’t have multiple columns with the same name in a table.
I said this application was “not intended” for general information. I chose that wording very carefully because:
I don’t want to get too distracted by every possible application loosely related to this. We have been trying to get an AxisArrays type system standardized for years and this is only one small step toward solving this.
“not intended” is more permissive than “forbidden” or “disallowed”. If you want to attach a stanza to each element of your axislabels/axiskekeys/etc that’s fine. Just don’t expect any special effort to be put into optimizing and debugging problems when used for newsyntax(data) that returns that.
If you have a related but diverging application you think will benefit the entire community and needs to be discussed more thoroughly, I’d love to discuss it in a new topic or github issue. It might be a good application for the metadata API in development.
DimensionalData.jl actually has a method lookup(A, 2) for this, and the objects are called LookupArrays. This came from trying to avoid key and index, and label for being too generic.
@Zach_Christensen in addition to naming, there is also a decision regarding the discussed function behavior. Should it return NamedTuples when the dimensions themselves are named, or the result should always be a Tuple?
I personally like the first variant more: getting “axis keys” together with dimension names is definitely a useful piece of functionality, and having two separate function (Tuple and NamedTuple ones) seems like an unnecessary increase of API surface.
We could probably figure out how to support multiple levels at some point but I think that implementation has issues. Not enforcing unique keys makes it ambiguous wether one is accessing an element or collection, not to mention performance issues. But I am interested in that sort of thing once we have these syntax in place.
Im in favour of returning NamedTuple to get both the dim names and lookup values from one method.
But yes unnamed dimensions is a problem for that, unless we adopt a numbering convention.
Extents.jl uses a NamedTuple-like wrapper object for a similar purpose, combing dimension names and extent bounds in one object. Returning NamedTuple was avoided to make dispatch easier.