# Optimization of Code for NaN Operations

Hi, I’ve been trying to perform operations on Arrays that may have missing values, using the below prototype function:

``````function nanop(f::Function,data::Array;dim=1)

ndim = ndims(data); dsize = size(data);
if ndim==2
if dim==1; newdim = [2,1]; data = permutedims(data,newdim); nsize = dsize[2]; end
elseif ndim>2
if dim==1; newdim = convert(Array,2:ndim); newdim = vcat(newdim,1);
data = permutedims(data,newdim);
nsize = dsize[2:end]
elseif dim<ndim
newdim1 = convert(Array,1:dim-1);
newdim2 = convert(Array,dim+1:ndim);
newdim  = vcat(newdim1,newdim2,dim)
data    = permutedims(data,newdim);
nsize   = tuple(dsize[newdim1]...,dsize[newdim2]...)
else nsize  = dsize[1:end-1]
end
else; nsize = dsize[1:end-1]
end

data = reshape(data,:,dsize[dim]); l = size(data,1); out = zeros(l);

for ii = 1 : l; dataii = data[ii,:];
out[ii] = f(skipmissing(dataii));
end

return reshape(out,nsize)

end
``````

However, benchmarking using @btime shows that something is seriously slowing down performance for large arrays.

``````@btime nanop(mean,rand(480,241,365),dim=3);;
@btime nanop(maximum,rand(480,241,365),dim=3);
@btime nanop(minimum,rand(480,241,365),dim=3);
@btime nanop(std,rand(480,241,365),dim=3);
``````

gives me

``````  619.931 ms (231370 allocations: 663.69 MiB)
618.972 ms (115690 allocations: 661.93 MiB)
619.175 ms (115690 allocations: 661.93 MiB)
829.410 ms (231370 allocations: 663.69 MiB)
``````

Is there any way to reduce memory allocation, because frankly I’m going to be doing operations like

``````@btime nanop(mean,rand(480,241,365,4),dim=4);;
@btime nanop(maximum,rand(480,241,365,4),dim=4);
@btime nanop(minimum,rand(480,241,365,4),dim=4);
@btime nanop(std,rand(480,241,365,4),dim=4);
``````

which frankly scares the crap out of me since I’ll have to perform the for-loop operation many many more times than in my first few test cases.

skipmissing brings a performance decrease of about factor 3-4 (edited, was 5 but 5 is too much):

``````julia> a=rand(480,241,365,4);

julia> @benchmark [x for x in a[:]]
BenchmarkTools.Trial:
memory estimate:  2.52 GiB
allocs estimate:  5
--------------
minimum time:     1.651 s (43.07% GC)
median time:      1.775 s (47.18% GC)
mean time:        1.783 s (47.36% GC)
maximum time:     1.922 s (51.20% GC)
--------------
samples:          3
evals/sample:     1

julia> @benchmark [x for x in skipmissing(a[:])]
BenchmarkTools.Trial:
memory estimate:  2.72 GiB
allocs estimate:  34
--------------
minimum time:     5.952 s (49.85% GC)
median time:      5.952 s (49.85% GC)
mean time:        5.952 s (49.85% GC)
maximum time:     5.952 s (49.85% GC)
--------------
samples:          1
evals/sample:     1
``````

You can try to set the missing values to something which is neutral to the operation, e.g. case of mean
you can do:

``````data[:]=[ if ismissing(x) zero(x) else x end for x in data[:] ]
for ii = 1 : l;
out[ii] = f(data[ii,:]);
end
``````

(you don’t need to allocate the column into dataii, I didn’t check if it is really allocated, I guess not)

1 Like

Thanks so much for your help! I will have to look into this later. will put up some benchmarking later once I get the time.

First of all, you are benchmarking also the creation of the array `rand(480,241,365)`, which takes time and allocates memory. I assume you already have have the array.

``````julia> @btime nanop(mean, rand(480,241,365); dim=3);
517.498 ms (231370 allocations: 663.69 MiB)

julia> r = rand(480,241,365);

julia> @btime nanop(mean, \$r, dim=3);
214.726 ms (231368 allocations: 341.55 MiB)
``````

If you mostly want to reduce memory use, you can do this instead:

``````mapslices(mean∘skipmissing, \$r; dims=3);`
``````

Benchmarking shows poorer timing, but far less memory use:

``````julia> @btime mapslices(mean∘skipmissing, \$r; dims=3);
355.505 ms (1388200 allocations: 34.43 MiB)
``````