I have a short function for computing the Continuous Ranked Probability Score (CRPS) from an ensemble of states `Xb::Array{Float64,2}`

and a set comparison state `y::Array{Float64,1}`

, where `y`

may have varying sparsity (i.e., `findall(isnan.(y))`

may or may not be empty), and when it *is* sparse, it typically only has ~8000 or so non-`NaN`

s.

The cluster I am running the following function on has relatively low-performing cores on the compute nodes, but many of them, so I know that threading is helpful here. I compute the CRPS using the following function:

```
Xb = rand(Float64, 6_000_000, 20)
y = rand(Float64, 6_000_000)
function CRPS(X,y)
inds = findall(.!isnan.(y));
N,M = size(X);
out = zeros(eltype(X),N)
Threads.@threads for n in inds
for j in 1:M
out[n] += abs(X[n,j]-y[n])/M;
for k in 1:M
out[n] -= abs(X[n,j]-X[n,k])/(2*M*(M-1));
end
end
end
return mean(out[inds])
end
```

My question is that running this function takes several hundred milliseconds, even with many threads. Looking at the output of `@report_opt CRPS(Xb,y)`

suggests 4 problems with runtime dispatch (three from the lock-wait-unlock of threading and one from the call to mapreduce within `mean(out[inds])`

) which I donâ€™t think I can do much about without completely restructuring the function.

Am I leaving a lot of performance on the table with this function?