Currently, we have python scripts able to handle large netcdf sets (100Go to 1To). Thanks to Dask and multithreading we was able to load and perform operations on chunk which would fit in memory. However, it is too slow for us and would like to explore possibilities to increase the execution speed.
A possibility is to use Julia. So I wrote an example matching our code. It appears this code is not able to work with large dataset.
using Dates
using Rasters
using Statistics
import NCDatasets
using Dagger
using CairoMakie
CairoMakie.activate!(type = "png")
startDate = DateTime(1960, 1, 1)
endDate = DateTime(1961, 1, 1)
timeStep = Dates.Hour(1)
datetimes = startDate:timeStep:(endDate-timeStep) |> collect
filenames::Vector{String} = []
for dt in datetimes
push!(filenames, "./input/data/2m air temperature/2m air temperature_0_$(Dates.format(dt, "yyyymmddTHHMMSS")).nc")
end
series = RasterSeries(filenames, Ti(datetimes); lazy=true, dropband=false, missingval=NaN, duplicate_first=true)
ds = Rasters.combine(series)[:,:,1,:]
x = range(1, length(lookup(ds, Ti)))
y = mean(ds .* cos.(deg2rad.(lookup(ds, Y)))'; dims=(Y, X))
y = dropdims(y, dims = (Y, X))
fig = scatterlines(x, y)
display(fig)
GC.gc()
print("end")
If the time range is short, it works and display the figure. However, for time range of a year, it does not work.
This package allows you to handle larger than RAM files and apply user defined functions along any dimension.
I will try to post a usage example tomorrow.
Yes unfortunately combine is not yet lazy, or even very well optimised.
By chance I just made a PR to DiskArrays.jl to allow lazy concatenation, which will fix your problem to some extent. But mean will likely still not use optimum chunking patterns. There are solutions to that on the way in the coming year or so.
YAXArrays may be better with these huge datasets and is pretty well interoperable with Rasters.jl these days anyway.
I tried to follow your example and set it into YAXArrays.jl:
using Statistics
using Dates
using YAXArrays
using CairoMakie
startDate = DateTime(1960, 1, 1)
endDate = DateTime(1961, 1, 1)
timeStep = Dates.Hour(1)
datetimes = startDate:timeStep:(endDate-timeStep) |> collect
filenames::Vector{String} = []
for dt in datetimes
push!(filenames, "./input/data/2m air temperature/2m air temperature_0_$(Dates.format(dt, "yyyymmddTHHMMSS")).nc")
end
cubes = Cubes.(filenames)
fullcube = concatenatecubes(cubes, Ti(datetimes))
#I am taking a small subset, because my dataset is 15000x15000 pixels large and this breaks my Julia process
sub =c[1000:1100, 900:1000,:]
# Now we define an inner function that is applied along the axes
function meanweight(xout, xin, y)
xout .= mean(xin .* cos.(deg2grad.(y))')
end
# I am to lazy to think about a proper deg2rad function but it needs to be defined:
deg2rad(x) = x
# Now we can apply the meanweight function along the dimensions that are specified
meancube = mapCube(meanweight, (sub, YAXArray((sub.Y,), sub.Y.val)), indims=(InDims("X","Y"), InDims("Y")), outdims=OutDims())
This only works when the X * Y dimension is small enough, because it needs to load the data for every time step into memory to compute this.
I am sure, that we could make this a bit more efficient by not loading the whole data into memory at the same time.
But this approach should still work for an arbitrary number of time steps because we will load the time steps independently into memory.
However, the next line is now puzzle me. I get this error ^^’ Internal Error: All cubes must have the same axes, cube number 2 does not match. I check the dims they have the same size and it is working with Rasters. So not sure it comes from the data.
For some reason the data is opened with a singleton time dimension so that when we try to concatenate the data it doesnt work, because the Time dimension is different for every cube.
Therefore we can index into every cube to get the first time array in the time dimension to enable the concatenation.
The following script is working with your data:
using Statistics
using Dates
using YAXArrays
using NetCDF
using DimensionalData
#using CairoMakie
startDate = DateTime(1960, 1, 1)
endDate = DateTime(1960, 1, 1, 6)
timeStep = Dates.Hour(1)
datetimes = startDate:timeStep:(endDate-timeStep) |> collect
filenames::Vector{String} = readdir("data_temperature/", join=true)
#for dt in datetimes
# push!(filenames, "temperature_data/2m air temperature_0_$(Dates.format(dt, "yyyymmddTHHMMSS")).nc")
#end
cubes = Cube.(filenames)
Cube(cubes)
fullcube = concatenatecubes(getindex.(cubes, Time=1), Ti(datetimes))
# Now we define an inner function that is applied along the axes
function meanweight(xout, xin, y)
xout .= mean(xin .* cos.(deg2rad.(y))')
end
# I am to lazy to think about a proper deg2rad function but it needs to be defined:
deg2rad(x) = x
# Now we can apply the meanweight function along the dimensions that are specified
meancube = mapCube(meanweight, (fullcube, YAXArray((fullcube.y,), fullcube.y.val)), indims=(InDims("X","Y"), InDims("Y")), outdims=OutDims())