Improve performance for multiple For-loops

,

I am trying to optimize a calculation of correlation on a geo-grid of size 16x16:

using NetCDF
function corr1vall(s, lon1, lat1) # scenario, longitude 1, latitude 1
    pathPv = myPath * "pv/" * ssps[s] * "/day/EU/regr/"
    rmcp(pathPv)
    
    corr = zeros(16,16,28)
    for m in 1:28 # model
        fnam = pathPv * readdir(pathPv)[m]
        pv = ncread(fnam, "pv") # size (16,16,7300)

        @views pv1 = pv[lon1,lat1,:]  # 20-yr time series of grid 1

        for lat in 1:16
            for lon in 1:16
                @views pv2 = pv[lon,lat,:] # 20-yr time series of grid 2
                corr[lon,lat,m] = cor(pv1/areaLat[1], pv2/areaLat[lat]) # normalize
            end
        end
    end
    return mean(corr, dims=3)[:,:,1] # multi-model mean
end

One grid takes currently 6.6 s:

@time corr1vall(1,1,1)
> 6.625234 seconds (89.95 k allocations: 1.171 GiB, 0.90% gc time)

Further I have 256 grids and 3 scenarios:

corrAllGrids = zeros(16,16,16*16)
g = 1 
for lat in 1:16
    for lon in 1:16
        corrAllGrids[:,:,g] = corr1vall(1, lon, lat)
        g += 1
    end
end

I hope to reduce the time for the first function before trying parallelizing, is there evident modifications I could make?

Replacing cor(pv1/areaLat[1], pv2/areaLat[lat]) with cor(pv1*(1/areaLat[1]), pv2*(1/areaLat[lat])) should be a major improvement. Division is about 6x slower than multiplication.

1 Like

Thanks for the hint. However, I still get 6.6-6.7 s after this.

Are you sure it is the loop has performance issue, not an IO issue?

Does this allocate a 1,868,800 element 28 times? I think there is an read! so you could allocate only once.
(High-level interface · NetCDF.jl)!

You should check for type stability. It looks like areaLat and myPath are defined outside of the function (global variables?)

1 Like

One problem is that not every model has the same time span: some 7300 days, some 7305, etc. Is it still feasible to allocate only once?

Right, I declared area weights areaLat and myPath globally.

Also, I barely remember statistics, but I guess cor(a*c1, b*c2) can be rewritten somehow, maybe c1*c2*cor(a, b) or something like that. If it is the case, constants should be factored out, since each pv1/areaLat[1] allocate new array.

You can avoid by declaring

function corr1vall(s, lon1, lat1, pathPv, areaLat)
...
end

It increases the number of arguments but makes everything type stable. If a large number of parameters is not what you want, you can wrap extra parameters in named tuple and use UnPack.jl to extract them in the function body

opts = (; pathPv = pathPv, areaLat = areaLat)
corr1vall(s, lon1, lat1, opts)
   @unpack pathPv, areaLat = opts
....
end
2 Likes

This is not quite correct, I think you just remove the multipliers, they don’t change the correlation except for the sign?

1 Like

Based on the suggestions so far, I modified to

function corr1vall(s, lon1, lat1, pathPv, areaLat) # scenario, longitude 1, latitude 1
    pathPv = myPath * "pv/" * ssps[s] * "/day/EU/regr/"
    rmcp(pathPv)
    
    corr = zeros(16,16,28)
    pv = zeros()
    for m in 1:28 # model
        fnam = pathPv * readdir(pathPv)[m]
        pv = ncread(fnam, "pv") # size (16,16,7300 or 7305 or 7160)

        @views pv1 = pv[lon1,lat1,:] * (1/areaLat[lat1]) # normalize

        for lat in 1:16
            for lon in 1:16
                @views pv2 = pv[lon,lat,:] * (1/areaLat[lat])
                corr[lon,lat,m] = cor(pv1, pv2)
            end
        end
    end
    return mean(corr, dims=3)[:,:,1]
end

Current runtime for one grid:

@time corr1vall(1,1,1, pathPv, areaLat)
> 6.357588 seconds (143.15 k allocations: 805.057 MiB, 0.60% gc time)

But I am still not sure about how to deal with I/O issue from ncread, since the time dimension of each model is not identical.

How long time does each read take, the calculations don’t seem that expensive, while reading data can take quite some time.

Reading each m takes around 0.20-0.22 s

0.22 sec * 28 models = 6.16 sec total for corr1vall, looks like a IO time to me.

4 Likes

You are right, instead of

cor(pv1/areaLat[1], pv2/areaLat[lat])

it should be just

sign(areaLat[1]) * sign(areaLat[lat]) * cor(pv1, pv2)
1 Like

I was going to say, don’t use global variables, move this

outside the loop, use dropdims instead of this

which makes an unnecessary copy.

But clearly, it’s all about file io.

(But also, making paths by string concatenation is not robust, use joinpath)

1 Like

You seem to be working along the last dimension, which is not advantageous. And that makes me wonder: how are data actually stored on file?

Is there some chance that those data are contiguous on file, but are reorganized during reading, perhaps to mimic how they work in python or C? That could cause the file reading itself to become slow, and later also hampers the correlation calculations.

@fabiangans

FIrst of all, did you profile this to find out where the time is spent? In case all the time is spent in ncread, there is not a lot you can do currently, however, 6s read time still is unusually high so it would be worth filing an issue and share an example file if you can confirm that time is spent in IO (just time reading a single file).

Another thing to remember is that ncread is and can not be type-stable because you never know what the data types inside a netcdf file are. So you have to make sure that the type of pv is known in some way to make an efficient inner loop. This can be done either by using a function barrier as NetCDF.open(function, filename, varname) provides or by using the mutating form of ncread! and pre-allocating the array as suggested already, for example like this:

corr = zeros(16,16,28)
pv = zeros(16,16,7300)

    for m in 1:28 # model
        fnam = pathPv * readdir(pathPv)[m]
        ncread!(fnam, "pv", pv) # size (16,16,7300)

This would reduce allocations and make it easier for the compiler to predict types in your second loop. However, as mentioned this only makes sense to change if not all time is spent inside ncread.

2 Likes

Thanks a lot for all your helpful insights! After moving reading .nc file to the outermost loop, I get an acceptable computation time.