# Simple CRPS optimization

I have a short function for computing the Continuous Ranked Probability Score (CRPS) from an ensemble of states `Xb::Array{Float64,2}` and a set comparison state `y::Array{Float64,1}`, where `y` may have varying sparsity (i.e., `findall(isnan.(y))` may or may not be empty), and when it is sparse, it typically only has ~8000 or so non-`NaN`s.

The cluster I am running the following function on has relatively low-performing cores on the compute nodes, but many of them, so I know that threading is helpful here. I compute the CRPS using the following function:

``````Xb = rand(Float64, 6_000_000, 20)
y  = rand(Float64, 6_000_000)

function CRPS(X,y)
inds = findall(.!isnan.(y));
N,M = size(X);
out = zeros(eltype(X),N)
for j in 1:M
out[n] += abs(X[n,j]-y[n])/M;
for k in 1:M
out[n] -= abs(X[n,j]-X[n,k])/(2*M*(M-1));
end
end
end
return mean(out[inds])
end
``````

My question is that running this function takes several hundred milliseconds, even with many threads. Looking at the output of `@report_opt CRPS(Xb,y)` suggests 4 problems with runtime dispatch (three from the lock-wait-unlock of threading and one from the call to mapreduce within `mean(out[inds])`) which I don’t think I can do much about without completely restructuring the function.

Am I leaving a lot of performance on the table with this function?

the main easy speedup I can see is using multiplication instead of division.