Help me beat my pythonist friend's code. Speeding up data reading with simple reduction from NetCDF file

I work with atmospheric simulations and I use NetCDF files of the order of 80GB to 1TB.

I am always jelous at my friend from the office (@lsterzinger) that does everything super fast in his python with xarray, zarr, dask, and all his environment for out-of-core computations. Now, I know some attemps have been done at getting an xarray-like package to use but I had not used them.

I am asking your help increase the performance of my toy example.

  • Before starting please note that having reproducible times is not easy because apparently there is some caching so that if you re run with the same file it will be much faster the second time. Not talking about compilation time but about data caching. If anyone has tips I would be glad to hear them. Therefore, please don’t pay that much attention to the specific timings. The take home message is that in its best moments Julia takes around 5x the best timing of Python script.
  • Also note that the real dataset is much larger, don’t know the impact but having large files is an important part of the problem, so subsetting was needed to be able to share (performance worsens with large file).
  • In this case I don’t think using BenchmarkTools is correct, I think it will suffer from data caching so more than one run may not be useful.

I for this example I have three toy data sets that can be downloaded here WARNING: 15GB of data:
speeds.nc NetCDF with two 2D variables, not chuncked
direct link: https://ucdavis.box.com/shared/static/yvy3hyvjli45pjroycyxjxs7y7vm61uw.nc
speeds_chunked.nc NetCDF with two 2D variables, chunks of 256,256,400
direct link: https://ucdavis.box.com/shared/static/0cvcwld3w9govhv5ijaa1e970c69x0xd.nc
speeds.zarr zarr with two 2D variables, chunks of 256,256,400
direct link: https://ucdavis.box.com/shared/static/4g9gtnfg3qsnx4z2f67yf8f76ikkdynh.tar

What I do (very inefficient):

using NCDatasets 

function get_max_sp_ncdatasets(file)
    usfc = Array{Float32,3}(undef,512,512,2400)
    vsfc = Array{Float32,3}(undef,512,512,2400)
    maxsp = Dataset(file) do ds
        load!(ds["USFC"].var, usfc, :,:,:)
        load!(ds["VSFC"].var,vsfc,:,:,:)
        mapslices(maximum,sqrt.(usfc.^2 .+ vsfc.^2), dims = (1,2))
    end
    return maxsp[:]
end

file_unchunked = "./speeds.nc"
file_zarr = "./speeds.zarr"
file_chunked = "./speeds_chunked.nc"

@time get_max_sp_ncdatasets(file_unchunked)
#57.155043 seconds (3.79 M allocations: 7.236 GiB, 0.59% gc time, 4.82% compilation time)
@time get_max_sp_ncdatasets(file_chunked)
 #16.300895 seconds (27.90 k allocations: 7.033 GiB, 0.33% gc time)

May things are wrong with this. In specific I need to load the whole data in memory and it is very slow. Let’s throw it out the window but note that chunking had a very good impact.

What my pythonist friend does:

import xarray as xr
import zarr as zr

def get_max_sp_xarray(file, zarr=False):
    if zarr:
        ds = xr.open_zarr(file)
    else:
        ds = xr.open_dataset(file)
    usfc = ds.USFC # Does not compute anything. Instantaneous
    vsfc = ds.VSFC # Does not compute anything. Instantaneous
    sp = (usfc**2 + vsfc**2)**(0.5) # Does not compute anything. Instantaneous
    return sp.max(dim=('x','y')).compute()


file_unchunked = './speeds.nc'
file_zarr = './speeds.zarr'
file_chunked = './speeds_chunked.nc'

%time get_max_sp_xarray(file_unchunked)
# CPU times: user 5.39 s, sys: 8.21 s, total: 13.6 s
#Wall time: 13.6 s

%time get_max_sp_xarray(file_chunked)
#CPU times: user 6.16 s, sys: 14.5 s, total: 20.7 s
#Wall time: 20.7 s

%time get_max_sp_xarray(file_zarr, zarr=True)
#CPU times: user 10.1 s, sys: 43.5 s, total: 53.5 s
#Wall time: 4.43 s

Faster, includes time to plot and the memory footprint is much lower and it even includes plotting! I think the example with zarr is the one to beat, although in my current project I do not have everything on zarr. Not sure why the unchunked version was slower, probably some pressure on the machine.

My attempt with YAXArrays.jl

YAXArrays is an xarray-like julia package.

using YAXArrays, NetCDF, Zarr
function get_max_sp_yaxarrays(file)
    ds = open_dataset(file)
    sp = map(ds["USFC"], ds["VSFC"]) do usfc, vsfc
        sqrt(usfc^2 + vsfc^2)
    end
    maxsp = mapslices(maximum, sp, dims = ("x","y")) 
    return maxsp[:]
end

file_unchunked = "./speeds.nc"
file_zarr = "./speeds.zarr"
file_chunked = "./speeds_chunked.nc"

@time get_max_sp_yaxarrays(file_unchunked)
#172.770939 seconds (53.90 M allocations: 10.256 GiB, 1.46% gc time, 15.46% compilation time)

@time get_max_sp_yaxarrays(file_chunked)
#16.795537 seconds (818.22 k allocations: 9.506 GiB, 2.21% gc time, 2.03% compilation time)
#Warning: There are still cache misses

@time get_max_sp_yaxarrays(file_zarr)
 #60.923438 seconds (24.11 M allocations: 21.867 GiB, 6.29% gc time, 23.03% compilation time)
#Warning: There are still cache misses

The version with zarr leaves much to be desired, the version with netcdf in the chunked data seems competitive. It still takes 4 times as long as the fastest Python version.

My attempts with Rasters.jl:

using Rasters
function get_max_sp_rasters(file)
    vsfc = Raster(file_chunked, key="VSFC")
    usfc = Raster(file_chunked, key="USFC")
    sp = sqrt.(vsfc.^2 .+ usfc.^ 2)
    maxsp = maximum(sp,dims=(X,Y))[:]
end

file_unchunked = "./speeds.nc"
file_zarr = "./speeds.zarr"
file_chunked = "./speeds_chunked.nc"


@time get_max_sp_rasters(file_unchunked)
#  18.926890 seconds (62.42 k allocations: 11.723 GiB, 5.58% gc time)

@time get_max_sp_rasters(file_chunked)
# 17.946154 seconds (62.42 k allocations: 11.723 GiB, 5.65% gc time)

@time get_max_sp_rasters(file_zarr)
# 18.267268 seconds (62.42 k allocations: 11.723 GiB, 5.75% gc time)

Interestingly the timing is very consistent across files. I would say the performance is very good (timing similar to the NCDatasets) but It cannot beat Python with Zarr.
My manifest (julia 1.7.1):

Status `/global/u2/a/aramreye/JuliaVsPythonReadingTests/Project.toml`
  [13f3f980] CairoMakie v0.7.2
  [85f8d34a] NCDatasets v0.11.9
  [30363a11] NetCDF v0.11.4
  [a3a2b9e3] Rasters v0.2.0
  [b8865327] UnicodePlots v2.8.0
  [c21b50f5] YAXArrays v0.1.4
  [0a941bbe] Zarr v0.7.1
9 Likes

Your download is asking us to sign up for box.com… I dont really want to do that :wink:

Can you make the file public accessible without a login?

And what was your problem with Rasters.jl? It wont work for zarr, but for netcdf you just do

stack = RasterStack("your_file.nc")
A = stack[:layer_name]
# or 
A = Raster("your_file.nc"; name=:layer_name) # uses a specific layer
# or just
A = Raster("your_file.nc") # takes the first layer

Then you basically have an array, but it will load and broadcast lazily. The chunking probably wont be lazy, because NCDatasets.jl hasn’t implemented the DiskArrays.jl standard yet. But you can manually reduce over slices instead.

2 Likes

Oh, I thought it was public. Can you try again?

I could make it run with Rasters. I think it conflicts with YAXArrays or something like that so it needs to be on a different julia session. I am updating the original post.

Oh right, just dependencies. I haven’t updated for the very latest DiskArrays.jl yet.

Downloads are working now, thanks

Could you provide a direct link that I could use with Curl or wget on the cluster, because I currently don’t have 15GB on the laptop anymore.

How much cores do you have for your computation, I had the feeling, that for such computations, multi processing would help, because then the IO is also parallelized at least with the Zarr data.
For YAXArrays you can simply add workers using Distributed and then you need to load available packages on all workers using @everywhere and it will be parallelized automatically.

Sure! I opened the direct download for the .nc, I will try to put the zarr in a tarball

Let’s say we have 8 cores. I am still not sure if using distributed would be a good option here. Because I do similar computations for different files it may be a good idea to parallelize over files instead of for one file. Some experimentation will be useful.

1 Like

Direct link for the zarr also up in the original post.

1 Like

But are they significant here? If so, for a day to day use it may be a good idea to have a custom image.

I don’t think they matter thaat much. But for this small example sometimes the first run takes 3 minutes and the other ones are much faster. I am assuming this is due to data caching that makes subsequent runs from BenchmarkTools kind of useless. I could be wrong though. That one part is strictly my intuition.

1 Like

Compilation time should not depend on the size of the data (unless there is some specialized method that is called for large files), so it is possible to check that with a small set.

1 Like

Yes, you are right. What I was trying to convey is that I don’t think for this problem paying attention to compilation time is useful. The problem is more about reading and operating on large datasets efficiently. (Sometimes when people advice to use BenchmarkTools one of the arguments is to dilute compilation time in the statistics).
Let me revise the post to be clearer.

I understand, but you can eliminate completely the compilation time in your daily use by building a system image. If that is a significant part of the problem, independently of being “fair” or not in this comparison, it is something that can make your daily experience nicer.

3 Likes

Not sure if helpful, but I gave this a quick try and here is what I got.

  1. slightly different result on my laptop, with your Julia code giving :
@time get_max_sp_ncdatasets(file_chunked)
 12.386769 seconds (38.94 k allocations: 7.033 GiB, 3.09% gc time)
@time get_max_sp_ncdatasets(file_unchunked)
 10.391312 seconds (172.24 k allocations: 7.040 GiB, 6.40% gc time, 0.28% compilation time)

which is quite similar to what python, via Jupiter notebook, gives me :

CPU times: user 6.67 s, sys: 6.11 s, total: 12.8 s
Wall time: 15.3 s
CPU times: user 6.95 s, sys: 9.15 s, total: 16.1 s
Wall time: 16.3 s
  1. but just using distributed can speed things up though

For example, I get

julia> @time get_max_sp_ncdatasets.(jj);
  3.445555 seconds (1.35 k allocations: 69.516 KiB)
julia> @time get_max_sp_ncdatasets.(jj);
  1.497712 seconds (1.35 k allocations: 66.703 KiB)
julia> @time get_max_sp_ncdatasets.(jj);
  2.074084 seconds (1.35 k allocations: 67.266 KiB)

with the following code :

using DistributedArrays
@everywhere import NCDatasets as ncd

@everywhere file= "./speeds.nc"
@everywhere ds=ncd.Dataset(file)
@everywhere usfc = Array{Float32,3}(undef,512,512,1)
@everywhere vsfc = Array{Float32,3}(undef,512,512,1)
@everywhere function get_max_sp_ncdatasets(t::Int)
    ncd.load!(ds["USFC"].var, usfc, :,:,t)
    ncd.load!(ds["VSFC"].var,vsfc,:,:,t)
    maximum(sqrt.(usfc.^2 .+ vsfc.^2), dims = (1,2))[1]
end

ii=collect(1:2400)
jj=distribute(ii)

@time get_max_sp_ncdatasets.(jj)
@time get_max_sp_ncdatasets.(jj)
2 Likes

Side note: For simplicity and better readability you could use a single @everywhere begin ... end block.

1 Like

Rasters.jl is not doing any chunked loading here (like it would with a tif) because NCDatasets.jl doesn’t do that yet. That’s probably why the time is the same, and also why the allocations are high.

So it looks like we need chunked iteration in NCDatasets.jl for this to be fast.

There also seem to be a lot of other ways to do this with Rasters/NCDatasets that are insanely slow… some work to be done here.

2 Likes

Not sure I understand this. Are the broadcasts lazy? Otherwise, this appears to allocate a lot. And why are usfc and vsfc 3D arrays instead of 2D?

Also,

sqrt(maximum(usfc.^2 .+ vsfc.^2), dims = (1,2)[1])

instead of

maximum(sqrt.(usfc.^2 .+ vsfc.^2), dims = (1,2))[1]

?

2 Likes

true

They are not lazy. Allocation was vastly reduced by reducing the 3rd dim of usfc, vsfc to 1 with load! being in place. That’s where most of the speed up seems to come from in fact (try get_max_sp_ncdatasets.(ii)). But with distributed it should scale on the bigger files. Anyway, I would assume that this can be sped up some more.

what’s happening under the hood? I mean it’s building a computation graph… so yeah it’s probably 100x more work by the developer. But more importantly, what parallelism is it using under the hood? For this kind of comparison you really want to limit to same resource.