Rasters: can I select certain time steps of a lazy rasterseries before or while reading it into memory?

aasdelat · June 27, 2024, 10:33pm

I have a bunch of grib files, each covering a different time period, but the same spatial extent. I have opened them as a RasterSeries (library Rasters.jl) with lazy=true, so that the data are not loaded into memory.

dataset = RasterSeries(input_files, :valid_time, child=RasterStack, lazy=true)

If I try to load them into memory by means of the read function, I get an error stating that I do not have enough memory.

My first question is why does it not use the swap file memory. When I use a Matrix object that does not fit into memory and I set a big enough swap file, I can see that the used swap space grows and my program works with that matrix. But it seems that a RasterSeries object does not use (cannnot use?) swap space. Is this a limitation that can/cannot be overcame?

My second question is if there is a way to select some time steps before to read the data into memory. The time dimension of the RasterSeries seems to be an integer index with a value per file. Id est, it refers to a specific file, and not to a time step. So, how could I select specific datetimes before to read the data (or while reading them) into memory?

Thanks in advance.

Raf · June 28, 2024, 7:44am

Why do you need to load them to swap?

But yes we need a keyword to ignore the memory check or we need to check swap space too. The memory check is a new thing and may need tweaking.

I’m not totally sure what you mean in the second question. series[1:4] is the most obvious way, does that not work? For selecting with DateTime this is specific to you files and code, so we need a full MWE to know.

aasdelat · July 19, 2024, 3:20pm

Thank you, @Raf , I will attempt to clarify.

Because I do not have enough RAM. I have a SSD disk, so the time penalty of using swap space can be, in some occasions, affordable. And I need to see if it is affordable in my case (with the data that I have in hands).

Should I open an issue at Github asking for this?, could it take a long to be implemented?

I think that perhaps I am not understanding what a RasterSeries is. Quoting myself:

I thought that the result of this would be a data structure with the spatial dimensions plus the time dimension (valid_time), but it is not. Instead, the RasterSeries that I get is something line a tuple, each element of the tuple corresponding to each of the files. So, in the RasterSeries, I have to select data by providing, first, the index of the fiile in the list of input files, and then, any other values for the spatial and temporal dimensions inside the file.

So, my second question can be modified to the following one: How can I read my files so that I get the time dimension as a result of the concatenation of the time dimension of the input files?

Thank you in advance, @Raf

Raf · July 19, 2024, 3:56pm

Yes julia memory checks may not include swap. I’m not sure if that’s a bug. Others had problems with this too so I put some work into it and you can now (on the latest Rasters.jl version) set checkmem=false in some functions and constructors. Or maybe in your case just set it overall with Rasters.checkmem!(false) so it works over lazy RasterSeries too.

I thought that the result of this would be a data structure with the spatial dimensions plus the time dimension (valid_time), but it is not

Yes that’s what a RasterSeries is. Its like a Raster of Rasters/RasterStacks. There is a combine function that’s meant to combine it into a single Raster/RasterStack but its currently slow and not lazy.

I’ve done the work in DiskArrays.jl for it to be lazy (DiskArrays.ConcatDiskArray), but haven’t yet upstreamed it to Rasters.jl. The idea from the start was always to be able to convert a RasterSeries into a single Raster or RastersStack lazily, and the reverse, using slice and combine.

Finishing slice/combine to be both fast and lazy has been on my list for a long time but I haven’t gotten to it yet.

aasdelat · July 19, 2024, 4:06pm

Thank you, @Raf . I will try the checkmem option.

Regarding the RasterSeries, my provisional solution will be to read every file in a loop while concatenating the time dimension in a new and large dimension object, and the data in a large Matrix{Float64,2} object (and I think it will make use of swap space).

Thank you , @Raf !.

Raf · July 19, 2024, 4:16pm

You should be able to do cat(rasters...; dims=Ti(thenewlookup)) too?

aasdelat · July 20, 2024, 8:38pm

Ok, thanks for the tip!
I was about to make a complicated loop, but thanks to your suggestiion, it will be much more simple.
I have made the following:

# Concatenate all the rasterstacks int he rasterseries, into a new rasterstack
# First, create the time dimension with all the time steps
new_time_dim = cat(
    [ lookup(dataset[i], :Ti) for i in 1:length(dataset)]...,
    dims=1
    )#cat
# Now, concatenate the variable var_name
new_ds_var = cat(
    [ dataset[i][var_name] for i in 1:length(dataset)]...,
    dims=Ti(new_time_dim)
    )#cat
dataset = RasterStack(
    ( dataset[1][lon_name], dataset[1][lat_name], new_ds_var ),
    name = (lon_name, lat_name, var_name)
    )#RasterStack

Topic		Replies	Views
What would be the best approach to handle large NetCDF sets? Geo multithreading , rasters	10	709	August 2, 2023
Memory Error Performance	9	1200	October 24, 2019
ANN: LazyStack, SliceMap, StarSlice Package Announcements	1	635	March 17, 2020
Larger than memory table format with lazy reads? General Usage array , dataframes , juliadb	5	1200	August 19, 2020
ANN: Feather.jl v0.4.0 (lazy edition) Data	2	1024	August 29, 2018

Rasters: can I select certain time steps of a lazy rasterseries before or while reading it into memory?

Related topics