Help me beat my pythonist friend's code. Speeding up data reading with simple reduction from NetCDF file

Maybe you could tell your friend to fix it’s python implementation too, otherwise it’ll be bad sport :wink:

1 Like

Isn’t maximum(M; dims=(1,2))[1] just the same as maximum(M) when M is a matrix?

And the quoted expression looks like it calls for @views and some mapping that avoids intermediate arrays.

Actually, ds["USFC"] is a custom array defined in NCDatasets. Only when you index it (e.g. ds["USFC"][:,:,t]), the data gets loaded into memory. By itself, ds[“USFC”] does not have copy of the data in memory.
Thanks for pointing out the issue with sqrt!

Python-xarray Original NCDatasets version Gael’s version with NCDatasets NCDatasets with loops Fabian’s version with NetCDF.jl
unchunked 8.77 s 8.34 s 6.64 s 6.20 s 8.13s
chunked 5.61 s 4.65 s 3.27 s 2.65 s 3.15 s

I run the benchmark on my system (i9-10900X CPU, Samsung NVMe SSD) and the original
NCDatasets code by @aramirezreyes was already a bit faster than the Python-xarray code on my machine.

When you use an expclitit loop you can reduce the time even more (in the end were are about 30% and 50% faster than Python-xarray for unchunked and chunked arrays).

I just made the test with 1 CPUs as I think xarray uses only 1 CPU.

NCDatasets with loops

using NCDatasets

# Note: sqrt should be taken on the maximum
# But keep this choice for consistency with other benchmarks
function get_max_sp_ncdatasets3(fname)
    NCDataset(fname) do ds
        ncu = ds["USFC"];
        ncv = ds["VSFC"];
        sz = size(ncu)
        u = zeros(Float32,sz[1],sz[2]);
        v = zeros(Float32,sz[1],sz[2]);
        max_speed = zeros(Float32,sz[3])
        speed = zeros(Float32,sz[1],sz[2]);

        @inbounds for t = 1:sz[3]
            ncd.load!(ncu.var,u,:,:,t)
            ncd.load!(ncv.var,v,:,:,t)
            @. speed = sqrt(u^2 + v^2)
            # this avoids the array speed but it is not faster
            #max_speed[t] = maximum(uv -> sqrt(uv[1]^2 + uv[2]^2),zip(u,v))
            max_speed[t] = maximum(speed)
        end
        return max_speed
    end
end

file_unchunked = "./speeds.nc"
file_chunked = "./speeds_chunked.nc"

max_speed3 = @time get_max_sp_ncdatasets3(file_unchunked);
max_speed3 = @time get_max_sp_ncdatasets3(file_chunked);

Note to myself: In future, it would be nice to have a non-allocating version of load! for a CFVariable (https://github.com/Alexander-Barth/NCDatasets.jl/issues/161)

Edit: I added Fabian’s results to the table.

3 Likes

isn’t it still worth moving the sqrt out of the maximum?

1 Like

Could you please add my short NetCDF.jl snippet to your benchmark? For me it is approx 2x faster than your NCDatasets loop for the chunked file + has much simpler code.

Edit: And that exact syntax+speed would work for NCDatasets if we manage to sort out https://github.com/Alexander-Barth/NCDatasets.jl/issues/79

Sure, I added it to the table (with NetCDF v0.11.4). In fact, there have been some improvement recently
(https://github.com/CliMA/TurbulenceConvection.jl/issues/723) in NCDatasets and maybe you are not using the latest version or something else is going on. I run your code 3 times and took the fastest result.

In any case, the syntax of NetCDF is really nice and it would great to use DiskArray in NCDatasets. If I remember correctly, I had these issues:

  • arrays that can be resized as some dimension in NetCDF can be “unlimited”
  • what about arrays with elements of variable size (link NetCDF strings, vlen-arrays and compound types). (sizeof(Int) works but sizeof(String) not; it makes computing the size of the cache a bit more difficult)

Maybe they are supported by now in DiskArrays, I don’t know. If not, would a PR to DiskArrays be in-scope ?

This is interesting. Thanks for testing. I am sure the SSD cannot hurt.
The fact that all the approaches are on the same ballpark in my case and in yours, except the case with zarr+xarray is interesting though.
I think that that is the only one that is not using a netcdf4 library binary themselves. I wonder if I will see an improvement if I manage to build NetCDF and NCDatasets with the optimized libraries in the cluster instead of the arctifact.

Another thing is that when I run the same but with two different files. The julia approaches take some good 4 - 5 minutes on a fresh session while the python ones take less than one minute (just used two files and one more variable apart from speed). So it is averaging 1 minute per variable after permuting the sqrt and the maximum. Not doing anything fancy at all. This files are chunked and differ with the ones I shared only in that they have more variables, but each variable is the same size as on the original post.

Something seems to be working very badly with the julia approaches at NERSC. I will try to do some digging but probably in the weekend.

2 Likes

I don’t think allocations matter here, most time is spent in netcdf_c and I am on NCDatasets 0.11.9. Are you sure you did not include precompilation times? I have put together a small benchmark where I run every approach on a 2x2x2400 netcdf file first to solve precompilation and then time on the large array. The code is here https://github.com/meggart/NetCDF_Benchmarks , I ran every approach 10 times from a fresh Julia session (see run_benchmarks.sh) and get the following consistent timings for the chunked NetCDF:

  • NCDataset Loop: 18.2s
  • NCDataset Gaels broadcsat: 20.1s
  • NetCDF.jl backed by DiskArrays: 13.2s

I did not download the unchunked filee yet but will do so and run the benchmark there as well. In general I am quite unhappy with the version that was marked as the solution because it simply does not generalise to possible different chunkings. Suppose someone wants to compute the maximum for each time series on the chunked file, then the NetCDF.jl version is:

@time NetCDF.open("/home/fgans/Downloads/speeds_chunked.nc") do f
           usfc = f["USFC"]
           vsfc = f["VSFC"]
           ws = sqrt.(usfc.^2 .+ vsfc.^2)

           maximum(ws,dims=(3,))
       end

and has the same timing (approx 13s on my machine). However, if I apply @gaelforget s approach:

import NCDatasets as ncd
filename = "/home/fgans/Downloads/speeds_chunked.nc"
ds=ncd.Dataset(filename)
get_max_sp_ncdatasets(x::Int,y::Int) = 
    maximum(sqrt.(ds["USFC"][x,y,:].^2 .+ ds["VSFC"][x,y,:].^2), dims = (3,))[1]
  
@time get_max_sp_ncdatasets.(1:512,(1:512)')
end

this took 1h and 8 minutes on my computer, because, as I explained above, this implements a mapslices-like data access and simply ignores internal chunking of the dataset.

Sorry for seeming overly pedantic here, but many people might look at this thread for reference and it would be important to highlight solutions that are generally fast and not only for this special case and have a low memory footprint.

2 Likes

I don’t mind over pedantic, please do continue the conversation and I will try and run the benchmarks on the computer at hand as well. However I am fearing that there is something else going on about how the NetCDF_jll interacts with the cluster I use. No matter what, I see consistently 4x slower times for for Julia than with Python/xarray. So I fear that my timings may be misleading for the discussion that is now happening.

I have two questions though.

  1. Shouldn’t precompilation happen at loading or even at pkg.add nowadays?
  2. Are there no issues when repeating the benchmarking for the same files many times in the same computer session? I think I see faster times the second time I run every case. I am way out of my depth here.

I don’t want to move this thread too far off-topic, but I think the second point is solved and for the first there was simply some disagreement about whether setindex! should automatically grow an array along the record dimension or if it should throw an error. We should resume the discussion in the issue and maybe organize some kind of triage call to solve this.

1 Like

I get 404 error with your link. But in any case, here is how I ran your code:

using NetCDF                                                                                                                                                                                                                                                                                                                                
@time NetCDF.open("speeds.nc") do f                                                                                                                                   
    usfc = f["USFC"]                                                                                                                                                  
    vsfc = f["VSFC"]                                                                                                                                                  
    ws = sqrt.(usfc.^2 .+ vsfc.^2)                                                                                                                                    
                                                                                                                                                                      
    maximum(ws,dims=(1,2))                                                                                                                                            
end  

and there is my julia session:

julia> include("/home/abarth/tmp/Benchmark/test_netcdf.jl");                                                                                                          
 10.651196 seconds (9.82 M allocations: 7.560 GiB, 7.00% gc time, 23.00% compilation time)                                                                            
                                                                                                                                                                      
julia> include("/home/abarth/tmp/Benchmark/test_netcdf.jl");                                                                                                          
  8.117245 seconds (64.62 k allocations: 7.035 GiB, 4.52% gc time, 0.31% compilation time)                                                                            
                                                                                                                                                                      
julia> include("/home/abarth/tmp/Benchmark/test_netcdf.jl");                                                                                                          
  8.132503 seconds (62.08 k allocations: 7.035 GiB, 4.53% gc time, 0.37% compilation time)                                                                            
                                                                                                                                                                      
julia> include("/home/abarth/tmp/Benchmark/test_netcdf.jl");                                                                                                          
  8.191646 seconds (62.08 k allocations: 7.035 GiB, 4.54% gc time, 0.35% compilation time)                                                                            
                                                                                                                                                                      
julia> VERSION                                                                                                                                                        
v"1.7.0" 

(@v1.7) pkg> status --manifest NetCDF DiskArrays                                                                                                                      
      Status `~/.julia/environments/v1.7/Manifest.toml`                                                                                                               
  [3c3547ce] DiskArrays v0.3.1                                                                                                                                        
  [30363a11] NetCDF v0.11.4                                                                                                                                           

Wrapping the whole code inside a function, did not make a difference.

Couple side questions, to educate myself and maybe complement the thread:

  • Could someone point to the place where chunking is defined in the netcdf or cf conventions docs?
  • How do we generate such files with Julia?

I have a recollection of folks on the python side discussing methods to create files in such a way as to optimize them for use with xarray & dask specifically (or something like that). But I could use a refresher, as I have not followed these developments closely. And maybe I am just confusing two different things.

Maybe this page on chunking is useful:

https://www.unidata.ucar.edu/software/netcdf/workshops/2011/nc4chunking/WhatIsChunking.html

Here is how you can create a chunked file in NCDatasets:

using NCDatasets
NCDataset("foo.nc","c") do ds
  defVar(ds,"chunked_var",randn(10,10),("lon","lat"), chunksizes = [5,5])
end

You can query the storage mode with:

ds = NCDataset("foo.nc"); 
chunking(ds["chunked_var"])
# output (:chunked, [5, 5])

or with the good old ncdump: when you run ncdump -h -s foo.nc you will see the storage attributions (option -s):

	double chunked_var(lat, lon) ;
		chunked_var:_Storage = "chunked" ;
		chunked_var:_ChunkSizes = 5, 5 ;
		chunked_var:_Endianness = "little" ;

If you use compression with deflatelevel then every chunk is compressed individually. So when you load a slice in your NetCDF file, only the corresponding chunks need to be decompressed (not the whole variable which would be quite slow :grinning:)

The shell command nccopy can also convert existing NetCDF files and add chunking (option -c) and possibly compression (option -d). For example:

nccopy -c dimension1/100,dimension2/100 -d 5 input.nc output.nc

The chunking is handled in the HDF5 layer (dependency of the NetCDF library) and it transparent to the user and a client application. But it can improve the performance if the chunking adapted to the typically access pattern.

Here is a good discussion about how to choose the chunk size (sorry not Julia, and not NetCDF, but still good :-))

Here is some additional information on how to control the cache (handled internally by NetCDF) of the loaded chunks.

4 Likes

I am not seeing an easy way in the Docs for NetCDF or NCDatasets. But is it possible to instruct these Packages to not use NetCDF_jll and instead link to the system libraries? NERSC supercomputing center has cray-optimized libraries and I would like to know if using them instead of the jll would help. @Alexander-Barth @fabiangans

1 Like

This NetCDF.jl fork uses the netcdf lib used by GMT, which is normally a system library. Has been a while since I last used it and I’m not even sure that it works on Unix, but worked fine on Windows. See the deps.jl for the logic that would have to be adapted to not rely on the GMT link.

I added this to the documentation (also applicable to NetCDF.jl):


The NetCDF library libnetcdf.so is installed as an artifact via the package NetCDF_jll.
You can override which libnetcdf.so gets loaded through the Preferences package, as follows:

# install these packages if necessary
using Preferences, NetCDF_jll

set_preferences!(NetCDF_jll, "libnetcdf_path" => "/path/to/libnetcdf.so.xyz")

where /path/to/libnetcdf.so.xyz is the full path to the NetCDF library.
This will create a LocalPreferences.toml file in your top-level project with the following content:

[NetCDF_jll]
libnetcdf_path = "/path/to/libnetcdf.so.xyz"

However, the dependencies of the library version libnetcdf.so.xyz (in particular libcurl.so and libmbedtls.so) should be compatible with the dependencies of julia (in the folder .../julia-x.y.z/lib/julia). On Linux, you can list the library dependencies with the shell command ldd, for example:

ldd /path/to/libnetcdf.so.xyz

The Preference package is quite general and can be used to customize any artifact (not just the NetCDF library).

4 Likes

A bit late to the game here but Zarr.jl does not use multithreading by default and xarrays does. Performance can be matched when multithreading is enabled. See https://github.com/JuliaIO/Zarr.jl/issues/65#issuecomment-929108057.

2 Likes