Reducing verbosity in array calculations

question
statistics
array
syntax

#1

I wrote simple code, but it looks messy because I am removing NaN and I usually want to apply element-by-element. To illustrate, I show my calculation of the Pearson correlation coefficient:

function r(obs, mod)
    om = (obs - mean(obs[!isnan.(obs)])) .* (mod - mean(mod[!isnan.(mod)]))
    oo = (obs - mean(obs[!isnan.(obs)])).^2
    mm = (mod - mean(mod[!isnan.(mod)])).^2
    sum(om[!isnan.(om)]) ./ sqrt.(sum(oo[!isnan.(oo)]) .* sum(mm[!isnan.(mm)]))
end

where obs and mod are arrays of floats of arbitrary dimension. It looks messy because of the additions of:

  • [!isnan.()] to the argument of every use of sum() and mean(); and
  • . for the functions isnan() and sqrt(), but of course not for sum() and mean().

There must be a less verbose, or more elegant, way to write this?

If you have comments on the functional part of the code, like bugs, those are welcome as well! My intention was to codify the equation for r, given by Stow et al., 2009 (alternative source).


#2

Should give the same result i think

function r(obs, mod)
    obs2 = obs[!isnan.(obs)];
    mod2 = mod[!isnan.(mod)]; 
    om = (obs2 - mean(obs2)) .* (mod2 - mean(mod2))
    oo = (obs2 - mean(obs2)).^2
    mm = (mod2 - mean(mod2)).^2
    sum(om) ./ sqrt.(sum(oo) .* sum(mm))
end

#3

For NaN’s, the following might also help:


#4

It looks cleaner, but then I tested it and found out that it does not do the same. With the original function I got a number \in [0, 1], but with your code I get NaN.

The NaN entries are, generally, at different indices in the two arrays obs and mod. Subsequently, obs2 and mod2 generally have different dimensions, and the indices of NaN don’t match.

Minimal example:

julia> a = [1 2 3]
1×3 Array{Int64,2}:
 1  2  3
julia> b = [1 NaN 3]
1×3 Array{Float64,2}:
 1.0  NaN  3.0
julia> a.*b
1×3 Array{Float64,2}:
 1.0  NaN  9.0
julia> aa = a[!isnan(a)]
3-element Array{Int64,1}:
 1
 2
 3
julia> bb = b[!isnan(b)]
2-element Array{Float64,1}:
 1.0
 3.0
julia> aa.*bb
ERROR: DimensionMismatch("arrays could not be broadcast to a common size")
Stacktrace:
...

This new code is equivalent to the original:

function r2(obs, mod)
    obs2 = obs[!isnan.(obs)];
    mod2 = mod[!isnan.(mod)];
    om = (obs - mean(obs2)) .* (mod - mean(mod2))
    oo = (obs - mean(obs2)).^2
    mm = (mod - mean(mod2)).^2
    sum(om[!isnan.(om)]) ./ sqrt.(sum(oo[!isnan.(oo)]) .* sum(mm[!isnan.(mm)]))
end

Sadly, it does not as nice as @y4lu’s code.


#5

Yes, I’ve seen this package. Can be very useful. Its main use appears to be creating NaNs, but I already have those in my data! :slight_smile:

Of course, I can then type NaNMath.mean(x) instead of mean(x,!isnan(x)), and one can shorten by first defining nm=NaNMath. It looks a bit cleaner, but one depends on an extra package.

It’s a good option, but I am not fully satisfied, well, maybe this is it and I should be.

edit: I’m reconsidering this; this looks quite good:

function r3(obs, mod)
    #using NaNMath
    nm = NaNMath
    om = (obs - nm.mean(obs)) .* (mod - nm.mean(mod))
    oo = (obs - nm.mean(obs)).^2
    mm = (mod - nm.mean(mod)).^2
    nm.sum(om) ./ sqrt.(nm.sum(oo) .* nm.sum(mm))
end

using in a function is not accepted, probably for good reasons; but that’s fine.


#6

A technical point, but it is generally safer to use const when renaming a module. For example:

import NaNMath
const nm = NaNMath

function r3(obs, mod)
...
end

#7

This doesn’t look too bad still (Switched a few operators to / from elementwise for clarity)

function r(obs, mod)
  v1 = !isnan.(obs)
  v2 = !isnan.(mod)
  v3 = v1 & v2;
  obs3 = obs[v3] .- mean(obs[v1])
  mod3 = mod[v3] .- mean(mod[v2])
  sum(obs3 .* mod3) / sqrt.( sum(obs3 .^ 2) * sum(mod3 .^ 2))
end

#8

That does not do the same as my code: there must be a !isnan.() criterion for the arguments in each sum(). But it gave me some ideas, and decided that this function is probably equivalent to my original code:

using NaNMath
function r32(obs, mod)
    const nm = NaNMath
    dobs = obs - nm.mean(obs)
    dmod = mod - nm.mean(mod)
    nm.sum(dobs .* dmod) ./ sqrt.(nm.sum(dobs .^ 2) .* nm.sum(dmod .^ 2))
end

I said probably because it is based on superficial code inspection and one simple input test, namely a=Float64.([1,2,3]); b=Float64.([1,2,NaN]) (no integers because of issue in NaNMath).

But I just found out that all above code is wrong. They appear to be implementing r from Stow et al. (2009), but they don’t. I made a thinking error in selecting my indices. Every time I use an obs, I need to mask the NaN from both obs and mod. Same for mod. If I adjust the latest code here, the whole NaNMath is not useful anymore and the code becomes verbose.

To make a long story short, I believe that this is the correct code, and very readable as well:

function r23(obs, mod)
  bis = !isnan.(obs) .& !isnan.(mod); # == !isnan.(obs.*mod)
  dobs = obs[bis] - mean(obs[bis])
  dmod = mod[bis] - mean(mod[bis])  
  sum((dobs .* dmod)) / sqrt( sum((dobs .^ 2)) * sum((dmod .^ 2)) )
end

where isobs means there is an observation defined at that array index, ismod similarly, and bis stands for both are or twice (quite clever, I thought :smile:).


#9

There are a few problems in the code for r23, it gets a deprecation warning on v0.6, and doesn’t work at all on v0.7 (master).

Also, if you need decent performance, while the above is correct and readable, it is about twice as slow as a version that does no allocations.

I’ve placed them in the following Gist: Code and Benchmark functions

Running them on my MacBookPro, my version was about twice as fast on v0.6.2, and 2.5x as fast on master [v0.7.0-DEV.3716 (2018-02-05 03:16 UTC)].

Note: there is also a significant performance regression between v0.6.2 and v0.7 here, on v0.7 my version was about 10% slower, and yours was about 36% slower than on v0.6.2. That’s something that needs to be addressed, IMO.