NCDatasets: performance issues deriving fast Minimum Maximum value as too slow

:turtle: :turtle: :turtle: :turtle:

Finding minimum and maximum of a NetCDF files due to missing values is painfully slow. It takes me 2 minutes to perform this simple task:

"NameOutput="SoilWater"
Output_NCDatasets = NCDatasets.NCDataset(Path)
		
Data = Output_NCDatasets[NameOutput]
Pmin, Pmax = extrema(x for x ∈ skipmissing(Data) if !isnan(x))
println(Pmin,"," , Pmax)

Where the OUTPUT is:

OUTPUT (291 × 313 × 31)
    Datatype:    Union{Missing, Float32} (Float32)
    Dimensions:  lon × lat × time
    Attributes:
     _FillValue           = NaN

Many thanks for any help you could provide to speedup the process? :smiley:

That’s not a very large array, so fully load it into memory first by doing Array(Data), and then do skipmissing on the result.

I can’t know exactly because you don’t provide the file.

But the problem is likely actually that NCDatasets.jl doesn’t properly implement the DiskArrays.jl implementation for CFVariable so it is calling the netcdf c library for literally every single pixel, rather than loading larger chunks. (@Alexander-Barth this is an example where making AbstractVariable a DiskArray has direct benefits to users - iteration and generators are chunked)

Either way for data this size, just load it into memory before use, it will always be faster.

One other way to speed it up immediately is to use Rasters.jl, which should be fast because it just loads to memory - but also will be faster from disk (via lazy=true) because it implements DiskArrays.

rast = Raster(Path; name=:SoilWater)
Pmin, Pmax = extrema(x for x ∈ skipmissing(rast) if !isnan(x))

(In a few weeks with the CF standars PR it will be even faster when you can intentionally not convert to missing and use the native missing value in skipmissing, which turned out to be much faster than what Base does with missing)

3 Likes

Thanks for providing a solution. Your solution solves the issue and the code is now running very fast thanks to converting Data into Array.

Data = Array(Data)

I wrote a code to animate the output of my hydrological model by moving a slider. When I run in Jupyter notebook in Visual Studio Code the plots open in GLMakie but when I run it in REPL it opens as a png. The question how can I force it to open in Makie when I run it in Visual Studio Code when running it in the REPL?


	function HEATMAP_TIME(;Path=Path, NameOutput="q_land", Layer=1)
		GLMakie.activate!
		Makie.inline!(false)  # Make sure to inline plots into Documenter output!
		
		Output_NCDatasets = NCDatasets.NCDataset(Path)
		
		Data = Output_NCDatasets[NameOutput]

		Data = Array(Data)
	
		Dimensions = length(size(Data))
	
		if Dimensions == 3
         N_Lon  = size(Data)[1]
         N_Lat  = size(Data)[2]
         N_Time = size(Data)[3]
	
		elseif Dimensions == 4
         N_Lon  = size(Data)[1]
         N_Lat  = size(Data)[2]
         N_Time = size(Data)[4]
		end

		Pmin, Pmax = extrema(x for x ∈ skipmissing(Data) if !isnan(x))
		@show Pmin, Pmax

	
		function DATA_3D_2_2D(Data; iTime=iTime, Dimensions=Dimensions, Layer=Layer)
			if Dimensions == 4
				return Data[:,:, Layer, iTime]
			elseif Dimensions == 3
				return Data[:,:, iTime]
			end
		end
		
		
		Fig = Figure(size=(Width, Height * 4.0))

		Ax_1 = Axis(Fig[1, 1],  title=NameOutput, xlabelsize=xlabelSize, ylabelsize=xlabelSize, xticksize=xticksize, xgridvisible=xgridvisible, ygridvisible=xgridvisible)
		
		sg = SliderGrid(Fig[2, 1],
		(label="iTime", range=1:1:N_Time, startvalue=1),
		width=550, tellheight=true)
		
		iTime = sg.sliders[1].value
		
		Data_Time = lift((iTime) -> DATA_3D_2_2D(Data; iTime=iTime, Dimensions), iTime)
	
		Data_Plot = heatmap!(Ax_1, 1:N_Lon, 1:N_Lat, Data_Time, colorrange=(Pmin, Pmax), colormap =:hawaii50)
	
		Colorbar(Fig[1, 2], Data_Plot; label=NameOutput, width=20, ticks = Pmin:(Pmax-Pmin)/5:Pmax)
	
		Fig
	end

	HEATMAP_TIME(;Path=Path_Output, NameOutput="vwc")

Not sure about Makie, I don’t use VSCode - maybe ask a separate question about that people wont find it here.

1 Like

:turtle: :turtle: :turtle: :turtle:

Thanks for providing a solution. Nevertheless I have the same problem as finding minimum and maximum is painfully slow when i am in jupyter notebook in Visual Studio code. It takes me 2 minutes to perform this simple task:

Pmin, Pmax = extrema(x for x ∈ skipmissing(Output) if !isnan(x))
println(Pmin,"," , Pmax)

Where the OUTPUT is:

OUTPUT (291 × 313 × 31)
    Datatype:    Union{Missing, Float32} (Float32)
    Dimensions:  lon × lat × time
    Attributes:
     _FillValue           = NaN

Many thanks for any help you could provide to speedup the process? :smiley:

@JosephPollacco
That shouldn’t take long at all. If it’s fast when you run it in the REPL (on the same machine) then the problem is in Jupyter notebook somehow, it that’s the case I suggest you open another thread. If it’s also slow in the REPL, then the problem is in the machine, perhaps the memory is full or it’s busy…etc

My money is on it’s a machine problem.

Thanks for your help.

I have a fast machine so the problem is somwhere else. As you recommended I tested the code by running in REPL and the speed of execution is as slow as when I run the code in Jupyter notebook.

I will openup another tread to see if we can get the statistics of the OUTPUT by using directly NCDatasets.

Thanks for your help

It runs almost instantaneously (< 0.1 seconds) for me, even in global scope:

julia> Output = Array{Union{Missing,Float32}}(rand(Float32, 291,313,31));

julia> @time Pmin, Pmax = extrema(x for x ∈ skipmissing(Output) if !isnan(x))
  0.053098 seconds (45.14 k allocations: 3.076 MiB, 57.64% compilation time)
(5.9604645f-8, 0.99999994f0)

Can you give a minimal working example (MWE) of code that reproduces your problem? See also Please read: make it easier to help you

PS. I’ve moved this to a new thread, since it is unrelated to the problem in the original thread. Please don’t “necropost” to ancient threads.

Thanks.
The problem has been solved

This is by transforming
Data = Array(Data)

Thanks,
Joseph

Not sure, whether this would solve the GLMakie issue in the REPL but it seems as if the GLMakie.activate! misses the brackets and is therefore not called.
It might be that in the Jupyter notebook GLMakie is already activated and therefore it doesn’t matter, but for the REPL it does.

1 Like

Have you tried to call

makie_window = display(GLMakie.Screen(), Fig)

I normally use it when I want to open multiple Makie window. If you are running a script and want to block it from finishing until the makie window is closed you can add.

wait(makie_window)
1 Like

Thanks Fliks for solving the issue. when opening with include. But does not work when **Running without debugging **

 include(raw"D:\MAIN\MODELS\ViewingOutput.jl")

a

		GLMakie.activate!()
		Makie.inline!(false)  # Make sure to inline plots into Documenter output!

The solution below works for all CASES

This is the best solution for plots with Sliders and works in Visual Studio Code for all scenarios:

  • Opening with include,
  • Running without debugging

Thanks for this robust solution