Implementing an equivalent of python's xarray (labeled nd-arrays) in julia

One of the python packages I use the most is an excellent library called xarray, which is very convenient for handing multi-dimensional data. This package provides a DataArray object which associates a n-d array with named dimensions and coordinate data, and provides convenient methods for indexing, aligning, and broadcasting. For instance, if I have two data arrays temp(time,lat,lon) and meantemp(lat,lon), xarray allows me perform simple operations like np.sqrt(((temp-meantemp)**2).mean('time')). This expression involves automatic broadcasting, and a reduction across the named dimension. Also, the final sqrt call returns an DataArray object transparently. It’s a pretty magical tool, and very useful to people involved in atmospheric science and oceanography.

I have started a implementation in julia at https://github.com/nbren12/XArray.jl. Currently, DataArray inherits from AbstractArray, but I am not sure if this is a good idea. My main questions are

  1. Should the DataArray type inherit from AbstractArray? Or should I manually redefine all the functions like +,-,*,sqrt?
  2. If I don’t subclass from AbstractArray, how can I ensure that simple function calls like sqrt(temp) work?
  3. If I do subclass from AbstractArray, how do I define Base.similar(x::DataArray, dims) so that named coordinate information is passed along to the new object.

I would love your input, and would welcome anybody who is interested in contributing to this project. If there is already an equivalent package in julia please let me know.

3 Likes

Before you work too much on this, have a look at NamedArrays.jl and AxisArrays.jl.

Regarding your questions, you should definitely inherit from AbstractArray. As for defining similar, you just need to add a method for your new type which creates the array with the names.

BTW, please do not call this DataArray as this is already the name of a well-know package and type.

2 Likes

Thanks for the info. At a quick glance. AxisArray looks like what I want.
Thanks for pointing it out to me.

Just for kicks I might carry on with my project as a learning exercise. I
will change the name so it doesn’t conflict with anything.

It is true that the name AxisArrays.jl is not very descriptive / discoverable (though I don’t have a better description).

xarray is quite a popular library, at least in my field, so it might help
to link to it on the AxisArrays github page. That way a google search of
xarray julia would turn up something.

1 Like

That’s a great idea. I’ve definitely taken some inspiration from xarray. A PR would be most welcome!

I was trying out AxisArrays, and it looks like the output of sqrt(A::AxisArray) is a float array.

julia> sqrt(AxisArray(rand(5), Axis{:i}(collect(1:5)))) |> typeof
Array{Float64,1}

Is this supposed to happen?

Yes, at the moment we’ve not extended many functions to be axis aware. So it’s expected, but not necessarily ideal.

The best path forward here would probably be to specialize broadcast and make it axis-aware. Then any dot-broadcast function (like sqrt.(A)) will automatically be axis-aware.

1 Like

I agree. IMO, the best thing about xarray is it’s ability to broadcast arrays in a very smooth and intelligent fashion. I have started working on this in my little toy package, and it seems like it shouldn’t be too hard. My current solution is to pass the full inds::Tuple{Axis, Axis...} object around in Base.similar instead of turning it into a tuple of axis lengths, if that makes any sense to you. I have also implemented some of the easier broadcast! methods (e.g. addition by scalar).

Since this functionality is pretty useful to me, and is fun learning exercise besides, I would like to add some of it to AxisArrays.jl. Would you be interested in a pull request at some point?

That’d be great! A pull request that added broadcast specializations would be very welcome. I think you could lean on Base.broadcast! pretty heavily, and just specialize the creation of the destination array.

We’ve taken a very similar approach to similar, but to be honest I’ve not found that method to be very useful since it’s really the caller that must be axis-aware… in which case they often just construct the AxisArray directly. Note that if you omit the dimensions entirely from similar, then it can return an AxisArray with the same axes as the original array.

Most of AxisArray’s development dates back to a Julia before the indices() function. I wonder if we couldn’t shoe-horn axis information into the actual indices of the array. It’s a very different situation from OffsetArrays, since the axes are not constrained to be integer unit ranges… but we could have a special integer unit range that also carries an axis vector. It’d describe both the integer indexing as well as the special axis behaviors. I’m not sure how much it’d gain us… but it might be worth investigating.

1 Like

I went ahead a pull request with a first pass at this broacasting stuff https://github.com/JuliaArrays/AxisArrays.jl/pull/54.

I think your comment about indices and similar is relevant since that shows up the julia broadcast.jl file in a few places, but in my pull request, I didn’t try to integrate that deeply with the broadcast.jl code.

GitHub - JuliaDataCubes/YAXArrays.jl: Yet Another XArray-like Julia package seems to have similar goal than AxisArrays and NamedArrays

I have reached this thread in a search by “julia xarrays”, and I think that it can be useful to mention that the most advanced library implementing something like xarrays, seems to be:

Julia is very modular. So, this library is very basic and does not provide, for example, a means to load data from disk into a DimensionalData array.
In order to get the functionality that we expect to find, you can use the library mentioned in the previous post, which is based on DimensionaData:

And I also suggest another option, also based on DimensionalData:

I think that Rasters has more fetatures. I have used it to read a bunch of grib files. YAXArrays can provide a Cube data set, that can be used to optimize calculations in certain scenarios.
I think the two libraries should converge to integrate all their funcionalities into one single library.
There are also other libraries that provide xarray like capabilities up to a certain extent, but I think that the more advanced and completed one is DimensionalData and the libraries based on it.
My experience is based on the use of these libraries in meteorological sciences.
Anyway, there seems not to be any library providing something like dask arrays with off core and cluster computations. I hope this functionality will be provided in a future.
I hope anyone else can provide his critical remarks to what I say, so we can benefit from them.
Many people (like me) looks for the same functionality as XArrays + Dask, and I consder that it is a “killing” software, id est, I will use Python and not Julia, just because I need the functionality of XArrays + Dask, although Julia is more efficient in anything else.

I have to correct myself about the “killing software” and the two reasons geven to consider xarrays+dask as such.

The first reason are the out-of-core computations, id est, the ability of using datastructures (i.e., matrices) greater than ram memory.

This task can be accomplished in Julia (and in any software) by the use of swap space, that is managed by the operating system. And, could this management be more efficient than an out-of-core explicit computation? For some calculations (i.e., PCA), out of core computations need to read the same chunks of data from disk, several times, and this makes me think that it can be possible that the use of swap space can be more efficient.

The second reason are the cluster capabilities of xarrays+dask, but these are also in Julia. I have not used them so far, but it goes without saying that it is one of the strengths of Julia.

Hi @aasdelat actually the YAXArrays allow to perform computations in parallel and to upscale analysis using clusters:

check the section “Distributed Computation”

From my understanding YAXArrays is using DiskArrays.jl package to do the operations and parallelization, but it will be replaced later by GitHub - meggart/DiskArrayEngine.jl

Thanks, @dpabon, I’ll have a look. I find difficult to grasp the use of the indims and outdims. I have also to learn about the Distributed and ClusterManagers libraries.