I have a short function for computing the Continuous Ranked Probability Score (CRPS) from an ensemble of states Xb::Array{Float64,2}
and a set comparison state y::Array{Float64,1}
, where y
may have varying sparsity (i.e., findall(isnan.(y))
may or may not be empty), and when it is sparse, it typically only has ~8000 or so non-NaN
s.
The cluster I am running the following function on has relatively low-performing cores on the compute nodes, but many of them, so I know that threading is helpful here. I compute the CRPS using the following function:
Xb = rand(Float64, 6_000_000, 20)
y = rand(Float64, 6_000_000)
function CRPS(X,y)
inds = findall(.!isnan.(y));
N,M = size(X);
out = zeros(eltype(X),N)
Threads.@threads for n in inds
for j in 1:M
out[n] += abs(X[n,j]-y[n])/M;
for k in 1:M
out[n] -= abs(X[n,j]-X[n,k])/(2*M*(M-1));
end
end
end
return mean(out[inds])
end
My question is that running this function takes several hundred milliseconds, even with many threads. Looking at the output of @report_opt CRPS(Xb,y)
suggests 4 problems with runtime dispatch (three from the lock-wait-unlock of threading and one from the call to mapreduce within mean(out[inds])
) which I don’t think I can do much about without completely restructuring the function.
Am I leaving a lot of performance on the table with this function?