Simple CRPS optimization

I have a short function for computing the Continuous Ranked Probability Score (CRPS) from an ensemble of states Xb::Array{Float64,2} and a set comparison state y::Array{Float64,1}, where y may have varying sparsity (i.e., findall(isnan.(y)) may or may not be empty), and when it is sparse, it typically only has ~8000 or so non-NaNs.

The cluster I am running the following function on has relatively low-performing cores on the compute nodes, but many of them, so I know that threading is helpful here. I compute the CRPS using the following function:

Xb = rand(Float64, 6_000_000, 20)
y  = rand(Float64, 6_000_000)

function CRPS(X,y)
	inds = findall(.!isnan.(y));
	N,M = size(X);
	out = zeros(eltype(X),N)
	Threads.@threads for n in inds
		for j in 1:M
			out[n] += abs(X[n,j]-y[n])/M;
			for k in 1:M
				out[n] -= abs(X[n,j]-X[n,k])/(2*M*(M-1));
	return mean(out[inds])

My question is that running this function takes several hundred milliseconds, even with many threads. Looking at the output of @report_opt CRPS(Xb,y) suggests 4 problems with runtime dispatch (three from the lock-wait-unlock of threading and one from the call to mapreduce within mean(out[inds])) which I don’t think I can do much about without completely restructuring the function.

Am I leaving a lot of performance on the table with this function?

the main easy speedup I can see is using multiplication instead of division.