What would be the best approach to handle large NetCDF sets?

Currently, we have python scripts able to handle large netcdf sets (100Go to 1To). Thanks to Dask and multithreading we was able to load and perform operations on chunk which would fit in memory. However, it is too slow for us and would like to explore possibilities to increase the execution speed.

A possibility is to use Julia. So I wrote an example matching our code. It appears this code is not able to work with large dataset.

using Dates
using Rasters
using Statistics
import NCDatasets
using Dagger

using CairoMakie
CairoMakie.activate!(type = "png")

startDate = DateTime(1960, 1, 1)
endDate = DateTime(1961, 1, 1)
timeStep = Dates.Hour(1)
datetimes = startDate:timeStep:(endDate-timeStep) |> collect

filenames::Vector{String} = []
for dt in datetimes
    push!(filenames, "./input/data/2m air temperature/2m air temperature_0_$(Dates.format(dt, "yyyymmddTHHMMSS")).nc")
end
series = RasterSeries(filenames, Ti(datetimes); lazy=true, dropband=false, missingval=NaN, duplicate_first=true)
ds = Rasters.combine(series)[:,:,1,:]

x = range(1, length(lookup(ds, Ti)))
y = mean(ds .* cos.(deg2rad.(lookup(ds, Y)))'; dims=(Y, X))
y = dropdims(y, dims = (Y, X))

fig = scatterlines(x, y)
display(fig)

GC.gc()

print("end")

If the time range is short, it works and display the figure. However, for time range of a year, it does not work.

What approach would you suggest ?

You could have a look at YAXArrays.jl.

This package allows you to handle larger than RAM files and apply user defined functions along any dimension.
I will try to post a usage example tomorrow.

1 Like

Yes unfortunately combine is not yet lazy, or even very well optimised.

By chance I just made a PR to DiskArrays.jl to allow lazy concatenation, which will fix your problem to some extent. But mean will likely still not use optimum chunking patterns. There are solutions to that on the way in the coming year or so.

YAXArrays may be better with these huge datasets and is pretty well interoperable with Rasters.jl these days anyway.

1 Like

Yes I’m curious to see an example.

I tried to follow your example and set it into YAXArrays.jl:

using Statistics
using Dates
using YAXArrays
using CairoMakie

startDate = DateTime(1960, 1, 1)
endDate = DateTime(1961, 1, 1)
timeStep = Dates.Hour(1)
datetimes = startDate:timeStep:(endDate-timeStep) |> collect


filenames::Vector{String} = []
for dt in datetimes
    push!(filenames, "./input/data/2m air temperature/2m air temperature_0_$(Dates.format(dt, "yyyymmddTHHMMSS")).nc")
end

cubes = Cubes.(filenames)
fullcube = concatenatecubes(cubes, Ti(datetimes))
#I am taking a small subset, because my dataset is 15000x15000 pixels large and this breaks my Julia process

sub  =c[1000:1100, 900:1000,:]

# Now we define an inner function that is applied along the axes
function meanweight(xout, xin, y)
       xout .= mean(xin .* cos.(deg2grad.(y))')
       end

# I am to lazy to think about a proper deg2rad function but it needs to be defined:
deg2rad(x) = x

# Now we can apply the meanweight function along the dimensions that are specified

meancube = mapCube(meanweight, (sub, YAXArray((sub.Y,), sub.Y.val)), indims=(InDims("X","Y"), InDims("Y")), outdims=OutDims())

This only works when the X * Y dimension is small enough, because it needs to load the data for every time step into memory to compute this.
I am sure, that we could make this a bit more efficient by not loading the whole data into memory at the same time.
But this approach should still work for an arbitrary number of time steps because we will load the time steps independently into memory.

Which version of YAXArrays.jl do you use, it seems mine does not work:

Status `~/.julia/environments/v1.9/Project.toml`
  [c21b50f5] YAXArrays v0.5.0

On cubes = Cubes.(filenames), it does not work.

Exception has occurred: MethodError
MethodError: objects of type Module are not callable

oops, Sorry, that is a typo it is supposed to be Cube.(filenames).

Yes it’s working with this correction !

However, the next line is now puzzle me. I get this error ^^’ Internal Error: All cubes must have the same axes, cube number 2 does not match. I check the dims they have the same size and it is working with Rasters. So not sure it comes from the data.

This might be a YAXArray bug. Could you show the result of DD.dims(cubes[1]) and DD.dims(cubes[2])?

Could you upload two example datasets for which the concatenation fails, this would make it much easier to debug this.

To me they look similar

Here is the textual ouput:

You can download the files are here:
https://drive.google.com/drive/folders/1b-745niN1uG7OMdqoQypREYpRNR-Am38?usp=drive_link
The two first do not concatenate like each others

For some reason the data is opened with a singleton time dimension so that when we try to concatenate the data it doesnt work, because the Time dimension is different for every cube.
Therefore we can index into every cube to get the first time array in the time dimension to enable the concatenation.
The following script is working with your data:

using Statistics
using Dates
using YAXArrays
using NetCDF
using DimensionalData
#using CairoMakie

startDate = DateTime(1960, 1, 1)
endDate = DateTime(1960, 1, 1, 6)
timeStep = Dates.Hour(1)
datetimes = startDate:timeStep:(endDate-timeStep) |> collect


filenames::Vector{String} = readdir("data_temperature/", join=true)
#for dt in datetimes
#    push!(filenames, "temperature_data/2m air temperature_0_$(Dates.format(dt, "yyyymmddTHHMMSS")).nc")
#end

cubes = Cube.(filenames)
Cube(cubes)
fullcube = concatenatecubes(getindex.(cubes, Time=1), Ti(datetimes))

# Now we define an inner function that is applied along the axes
function meanweight(xout, xin, y)
       xout .= mean(xin .* cos.(deg2rad.(y))')
       end

# I am to lazy to think about a proper deg2rad function but it needs to be defined:
deg2rad(x) = x

# Now we can apply the meanweight function along the dimensions that are specified
meancube = mapCube(meanweight, (fullcube, YAXArray((fullcube.y,), fullcube.y.val)), indims=(InDims("X","Y"), InDims("Y")), outdims=OutDims())