The fate of DimensionalArrays / AxisArrays in Julia, and which to actually use

I wasn’t sure whether I should open this at “Domains” or “Community”, given how many different areas it touches

So, as far as I can tell, at the moment the story of using array data coupled with information about dimensions is not at all settled in Julia, as is evident from the existence of numerous in-use or under-development collections like e.g. AxisArrays, AxisArraysFuture, NamedDims.jl, DimensionalData.jl (and its derivatives), NamedArrays.jl to name a few (there are probably more, and I’ll be editing this part of the post as I learn more).

To be honest I was a bit sad about this situation given that in e.g. Python xarray is an “established standard” with stellar documentation. But I was mostly surprised, given how useful these kind of array structures are for so many different fields.

So, I am wondering, what are all these developers have to say about this disparity, and whether we can, as a community, have some kind of golden standard for dimensional arrays for Julia by channeling our efforts into one great thing. I am not a developer on this topic (although I’d like to be), but as a user I’d like to see some statements from the core developers on what is the status quo and the near-future plans and honestly, which package should users use. (from my perspective this started because of the need of having a standarized convertion for .nc files into dimension-arrays in NCDatasets.jl.

8 Likes

This has been discussed extensively on github here:
https://github.com/JuliaCollections/AxisArraysFuture/issues/1

And here a few months ago:
https://discourse.julialang.org/t/status-of-axisarrays-jl/

People mostly agree that AxisArrays is inadequate, but for a bunch of different reasons. Mostly we decided to keep working separately (but copying good ideas) and check back in later. There are multiple groups with different goals and reaching a consensus single solution is premature, and really not worth the effort until it’s clearer what the best strategy is. The code isn’t that hard to write, knowing exactly how it should behave that suits all use cases is the issue. Also, competitive cooperation is often a good thing, not some problem we have to fix. I’ve learned a lot from all of the other efforts.

xarrays has stellar documentation because it has been funded. If someone paid me to work on DimensionalData.jl it would too. But its currently just my own side project. And until there is a consensus approach, again it’s not really worth putting the effort in to reach that kind of documentation standard.

Edit: further GeoData.jl already does what you want for NCDatasets.jl. I’m actually kind of confused about why you are avoiding it without at least trying it out. The tests go through everything you need to load netcdf files with dimensional indexing. Your reasoning was exactly mine for writing it in the first place. (Just don’t use gdal with it lol, lots of changes happening over there so its currently broken)

6 Likes

Cool thanks for the heads up. Can someone also make a statement on the stability of all these different options? The WIP: The Plan issue you cite mostly discusses design decisions that could be improved, or similar ideas. I couldn’t deduce the stability status for most of these by looking at the READMEs. Sure, at the moment they all seem fine, but I don’t know whether the devs intend to completely drop support for something (which seems to be the case for e.g. AxisArrays due to AxisArraysFuture) or stuff like that.

Edit: further GeoData.jl already does what you want for NCDatasets.jl. I’m actually kind of confused about why you are avoiding it without at least trying it out. The tests go through everything you need to load netcdf files with dimensional indexing. Your reasoning was exactly mine for writing it in the first place. (Just don’t use gdal with it lol, lots of changes happening over there so its currently broken)

As I said, I’d be happy to use it! I am a fanatic of not re-writing code that already exists. I just read the docs and just didn’t understand much. I’ll try again on Monday with the tests as you suggested (as this is the first time I was instructed to learn something through the test suite, I typically always go for the docs).

1 Like

The hot new things don’t have docs yet, at least in julia they have tests :slight_smile:

Edit: Well they actually do have docs, but yeah a list of methods and types doesn’t help that much I know.

Roughly speaking it is intentional that we have a plethora of options right now.
It was decided that we would explore all the options in parallel and regroup in a year or so and make some conclusions.

A Birds of a Feather session is planned for JuliaCon.
I imagine that we will still have another year of exploring before the dust settles.

Tabular Data looks the same for a long time.

Everything basically should just work with everything.

I am quite happy with NamedDims.
It’s fully functioning. I still want to explore some ideas for changing how the names are represented, bit for now it works well.
It’s well tested with Flux And some statistics operations, and we run it in production.

I will be happier once we start integrating IndexedDims into our production system, so I know that works in practice also.

And then see how they work together.

1 Like

Sounds good. I hope all involved package authors are able to go to JuliaCon and join the session. To start converging on an approach comparative benchmarks will also be important, luckily I already saw some in AxisRanges.jl/test/speed.jl.

Since Python’s xarray was mentioned, I just want to make the point that that naming your dimensions and adding coordinates along them is only a part of the appeal of xarray. The CDF data model and more importantly integration with dask for lazy and parallel processing of large datasets are also key elements. In Julia I guess this ability could for instance be powered by Dagger or Dispatcher. Though hopefully we can park that in the back of our mind and have trust in Julia’s strenght in composing these elements when we get there.

4 Likes

NamedDims has tests to show it has no overhead at runtime.
That it compiles away to not be there at all.
One of them is currently @test_broken I will admit, but most are not and the goal of the package is to have zero overhead.

(Those tests are thanks to @mcabbott, who also wrote AxisRanges.jl, and I am happy to have it because it means if new versions of julia break inference it gets picked up as a regression)

2 Likes

Hello, any update as of 2022 ? Some “consensus” has been reached on working with arrays with named dimensions and indices ?

8 Likes