Improve performance for multiple For-loops

EuRoXy · January 13, 2021, 4:17pm

I am trying to optimize a calculation of correlation on a geo-grid of size 16x16:

using NetCDF
function corr1vall(s, lon1, lat1) # scenario, longitude 1, latitude 1
    pathPv = myPath * "pv/" * ssps[s] * "/day/EU/regr/"
    rmcp(pathPv)
    
    corr = zeros(16,16,28)
    for m in 1:28 # model
        fnam = pathPv * readdir(pathPv)[m]
        pv = ncread(fnam, "pv") # size (16,16,7300)

        @views pv1 = pv[lon1,lat1,:]  # 20-yr time series of grid 1

        for lat in 1:16
            for lon in 1:16
                @views pv2 = pv[lon,lat,:] # 20-yr time series of grid 2
                corr[lon,lat,m] = cor(pv1/areaLat[1], pv2/areaLat[lat]) # normalize
            end
        end
    end
    return mean(corr, dims=3)[:,:,1] # multi-model mean
end

One grid takes currently 6.6 s:

@time corr1vall(1,1,1)
> 6.625234 seconds (89.95 k allocations: 1.171 GiB, 0.90% gc time)

Further I have 256 grids and 3 scenarios:

corrAllGrids = zeros(16,16,16*16)
g = 1 
for lat in 1:16
    for lon in 1:16
        corrAllGrids[:,:,g] = corr1vall(1, lon, lat)
        g += 1
    end
end

I hope to reduce the time for the first function before trying parallelizing, is there evident modifications I could make?

Oscar_Smith · January 13, 2021, 4:23pm

Replacing cor(pv1/areaLat[1], pv2/areaLat[lat]) with cor(pv1*(1/areaLat[1]), pv2*(1/areaLat[lat])) should be a major improvement. Division is about 6x slower than multiplication.

EuRoXy · January 13, 2021, 4:37pm

Thanks for the hint. However, I still get 6.6-6.7 s after this.

cchderrick · January 13, 2021, 4:38pm

Are you sure it is the loop has performance issue, not an IO issue?

Does this allocate a 1,868,800 element 28 times? I think there is an read! so you could allocate only once.
(High-level interface · NetCDF.jl)!

Skoffer · January 13, 2021, 4:53pm

You should check for type stability. It looks like areaLat and myPath are defined outside of the function (global variables?)

EuRoXy · January 13, 2021, 4:56pm

One problem is that not every model has the same time span: some 7300 days, some 7305, etc. Is it still feasible to allocate only once?

EuRoXy · January 13, 2021, 4:58pm

Right, I declared area weights areaLat and myPath globally.

Skoffer · January 13, 2021, 4:59pm

Also, I barely remember statistics, but I guess cor(a*c1, b*c2) can be rewritten somehow, maybe c1*c2*cor(a, b) or something like that. If it is the case, constants should be factored out, since each pv1/areaLat[1] allocate new array.

Skoffer · January 13, 2021, 5:03pm

You can avoid by declaring

function corr1vall(s, lon1, lat1, pathPv, areaLat)
...
end

It increases the number of arguments but makes everything type stable. If a large number of parameters is not what you want, you can wrap extra parameters in named tuple and use UnPack.jl to extract them in the function body

opts = (; pathPv = pathPv, areaLat = areaLat)
corr1vall(s, lon1, lat1, opts)
   @unpack pathPv, areaLat = opts
....
end

baggepinnen · January 13, 2021, 5:20pm

This is not quite correct, I think you just remove the multipliers, they don’t change the correlation except for the sign?

EuRoXy · January 13, 2021, 5:31pm

Based on the suggestions so far, I modified to

function corr1vall(s, lon1, lat1, pathPv, areaLat) # scenario, longitude 1, latitude 1
    pathPv = myPath * "pv/" * ssps[s] * "/day/EU/regr/"
    rmcp(pathPv)
    
    corr = zeros(16,16,28)
    pv = zeros()
    for m in 1:28 # model
        fnam = pathPv * readdir(pathPv)[m]
        pv = ncread(fnam, "pv") # size (16,16,7300 or 7305 or 7160)

        @views pv1 = pv[lon1,lat1,:] * (1/areaLat[lat1]) # normalize

        for lat in 1:16
            for lon in 1:16
                @views pv2 = pv[lon,lat,:] * (1/areaLat[lat])
                corr[lon,lat,m] = cor(pv1, pv2)
            end
        end
    end
    return mean(corr, dims=3)[:,:,1]
end

Current runtime for one grid:

@time corr1vall(1,1,1, pathPv, areaLat)
> 6.357588 seconds (143.15 k allocations: 805.057 MiB, 0.60% gc time)

But I am still not sure about how to deal with I/O issue from ncread, since the time dimension of each model is not identical.

baggepinnen · January 13, 2021, 5:35pm

How long time does each read take, the calculations don’t seem that expensive, while reading data can take quite some time.

EuRoXy · January 13, 2021, 5:38pm

Reading each m takes around 0.20-0.22 s

cchderrick · January 13, 2021, 5:42pm

0.22 sec * 28 models = 6.16 sec total for corr1vall, looks like a IO time to me.

Skoffer · January 13, 2021, 5:54pm

You are right, instead of

cor(pv1/areaLat[1], pv2/areaLat[lat])

it should be just

sign(areaLat[1]) * sign(areaLat[lat]) * cor(pv1, pv2)

DNF · January 13, 2021, 5:58pm

I was going to say, don’t use global variables, move this

outside the loop, use dropdims instead of this

which makes an unnecessary copy.

But clearly, it’s all about file io.

(But also, making paths by string concatenation is not robust, use joinpath)

DNF · January 13, 2021, 6:08pm

You seem to be working along the last dimension, which is not advantageous. And that makes me wonder: how are data actually stored on file?

Is there some chance that those data are contiguous on file, but are reorganized during reading, perhaps to mimic how they work in python or C? That could cause the file reading itself to become slow, and later also hampers the correlation calculations.

EuRoXy · January 13, 2021, 6:16pm

@fabiangans

fabiangans · January 14, 2021, 6:36am

FIrst of all, did you profile this to find out where the time is spent? In case all the time is spent in ncread, there is not a lot you can do currently, however, 6s read time still is unusually high so it would be worth filing an issue and share an example file if you can confirm that time is spent in IO (just time reading a single file).

Another thing to remember is that ncread is and can not be type-stable because you never know what the data types inside a netcdf file are. So you have to make sure that the type of pv is known in some way to make an efficient inner loop. This can be done either by using a function barrier as NetCDF.open(function, filename, varname) provides or by using the mutating form of ncread! and pre-allocating the array as suggested already, for example like this:

corr = zeros(16,16,28)
pv = zeros(16,16,7300)

    for m in 1:28 # model
        fnam = pathPv * readdir(pathPv)[m]
        ncread!(fnam, "pv", pv) # size (16,16,7300)

This would reduce allocations and make it easier for the compiler to predict types in your second loop. However, as mentioned this only makes sense to change if not all time is spent inside ncread.

EuRoXy · January 14, 2021, 4:09pm

Thanks a lot for all your helpful insights! After moving reading .nc file to the outermost loop, I get an acceptable computation time.

Topic		Replies	Views
For loop in function and multiplication of larger matrices, slow speed in parallel Performance performance , parallel , loops	3	1316	November 22, 2019
How to speed up calculating large array? Performance question	11	1153	October 16, 2017
Get intuition on how to improve matmul code Performance question	14	557	November 6, 2019
How to optimize these nesting loops? General Usage optimization	12	799	May 22, 2018
Redefining an integer makes computing time blow up Performance	5	649	April 25, 2018

Improve performance for multiple For-loops

Related topics