I wrote simple code, but it looks messy because I am removing NaN and I usually want to apply element-by-element. To illustrate, I show my calculation of the Pearson correlation coefficient:
function r(obs, mod)
om = (obs - mean(obs[!isnan.(obs)])) .* (mod - mean(mod[!isnan.(mod)]))
oo = (obs - mean(obs[!isnan.(obs)])).^2
mm = (mod - mean(mod[!isnan.(mod)])).^2
sum(om[!isnan.(om)]) ./ sqrt.(sum(oo[!isnan.(oo)]) .* sum(mm[!isnan.(mm)]))
end
where obs and mod are arrays of floats of arbitrary dimension. It looks messy because of the additions of:
[!isnan.()] to the argument of every use of sum() and mean(); and
. for the functions isnan() and sqrt(), but of course not for sum() and mean().
There must be a less verbose, or more elegant, way to write this?
If you have comments on the functional part of the code, like bugs, those are welcome as well! My intention was to codify the equation for r, given by Stow et al., 2009 (alternative source).
It looks cleaner, but then I tested it and found out that it does not do the same. With the original function I got a number \in [0, 1], but with your code I get NaN.
The NaN entries are, generally, at different indices in the two arrays obs and mod. Subsequently, obs2 and mod2 generally have different dimensions, and the indices of NaN don’t match.
Minimal example:
julia> a = [1 2 3]
1×3 Array{Int64,2}:
1 2 3
julia> b = [1 NaN 3]
1×3 Array{Float64,2}:
1.0 NaN 3.0
julia> a.*b
1×3 Array{Float64,2}:
1.0 NaN 9.0
julia> aa = a[!isnan(a)]
3-element Array{Int64,1}:
1
2
3
julia> bb = b[!isnan(b)]
2-element Array{Float64,1}:
1.0
3.0
julia> aa.*bb
ERROR: DimensionMismatch("arrays could not be broadcast to a common size")
Stacktrace:
...
This new code is equivalent to the original:
function r2(obs, mod)
obs2 = obs[!isnan.(obs)];
mod2 = mod[!isnan.(mod)];
om = (obs - mean(obs2)) .* (mod - mean(mod2))
oo = (obs - mean(obs2)).^2
mm = (mod - mean(mod2)).^2
sum(om[!isnan.(om)]) ./ sqrt.(sum(oo[!isnan.(oo)]) .* sum(mm[!isnan.(mm)]))
end
Yes, I’ve seen this package. Can be very useful. Its main use appears to be creatingNaNs, but I already have those in my data!
Of course, I can then type NaNMath.mean(x) instead of mean(x,!isnan(x)), and one can shorten by first defining nm=NaNMath. It looks a bit cleaner, but one depends on an extra package.
It’s a good option, but I am not fully satisfied, well, maybe this is it and I should be.
edit: I’m reconsidering this; this looks quite good:
function r3(obs, mod)
#using NaNMath
nm = NaNMath
om = (obs - nm.mean(obs)) .* (mod - nm.mean(mod))
oo = (obs - nm.mean(obs)).^2
mm = (mod - nm.mean(mod)).^2
nm.sum(om) ./ sqrt.(nm.sum(oo) .* nm.sum(mm))
end
using in a function is not accepted, probably for good reasons; but that’s fine.
That does not do the same as my code: there must be a !isnan.() criterion for the arguments in each sum(). But it gave me some ideas, and decided that this function is probably equivalent to my original code:
using NaNMath
function r32(obs, mod)
const nm = NaNMath
dobs = obs - nm.mean(obs)
dmod = mod - nm.mean(mod)
nm.sum(dobs .* dmod) ./ sqrt.(nm.sum(dobs .^ 2) .* nm.sum(dmod .^ 2))
end
I said probably because it is based on superficial code inspection and one simple input test, namely a=Float64.([1,2,3]); b=Float64.([1,2,NaN]) (no integers because of issue in NaNMath).
But I just found out that all above code is wrong. They appear to be implementing r from Stow et al. (2009), but they don’t. I made a thinking error in selecting my indices. Every time I use an obs, I need to mask the NaN from both obsandmod. Same for mod. If I adjust the latest code here, the whole NaNMath is not useful anymore and the code becomes verbose.
To make a long story short, I believe that this is the correct code, and very readable as well:
where isobs means there is an observation defined at that array index, ismod similarly, and bis stands for both are or twice (quite clever, I thought ).
Running them on my MacBookPro, my version was about twice as fast on v0.6.2, and 2.5x as fast on master [v0.7.0-DEV.3716 (2018-02-05 03:16 UTC)].
Note: there is also a significant performance regression between v0.6.2 and v0.7 here, on v0.7 my version was about 10% slower, and yours was about 36% slower than on v0.6.2. That’s something that needs to be addressed, IMO.