For what it’s worth, I think that for most users and use-cases, naive_findmax is a perfectly appropriate function / implementation to use, so long as one understands that they’re not accounting for some of the corner cases that happen with floats.
It’s just that at a language level, when we write a function like findmax in Base, we have no idea if the user will be sensitive to these things or not, so we typically err on the side of safety here, and try to give better, more robust implementations even if it does cost performance in some circumstances.
It’d be good if we could make @fastmath findmax(itr) do the fast thing for you automatically, but at least for now, it’s quite easy and quick to roll your own.
So now it’s way faster than that C code (why?), but I’m not even sure it’s the limit of speedup.
I was thinking a naive_findmax could be Base, or under findmax with keyword argument (and every time Base isn’t fastest for something, explain faster option available, without taking edge cases into account). But this seems to be a deep rabbit hole and maybe even faster here:
At least document that findmax/min aren’t fastest, and point to alternatives (if not simply adding fastest) likely FindFirstFunctions.jl in their docstrings?
julia> function assignment_probabilities_sorted_gauss(utilities,tmp, cutoffs, mean, stddev)
last_prob = 1.0;
expectedU = 0.0;
maxfound = -Inf;
maxidx = 0
sch_left = 500;
for i=1:sch_left
item = utilities[i]
if item > maxfound
maxfound = item
maxidx = i
end
tmp[i] = maxidx
end
while sch_left > 0
ix = tmp[sch_left]
max_util = utilities[ix]
if max_util > 0
this_prob = cdf_eval(cutoffs[ix], mean, stddev)
p = last_prob - this_prob;
expectedU += p * max_util;
last_prob = this_prob;
sch_left = ix - 1;
else
break
end
end
return expectedU
end
That does give you a speedup from N^2 to N in the worst-case where utilities are monotonically increasing (and you might be able to re-use the data in tmp if you ever want to run this on the same utility vector with different cutoffs/mean/stddev).
I feel like there’s got to be a way to make findmax faster on floats despite the annoyance of NaNs. On ARM the fmax and fmin instructions already do the right thing (propagate NaNs), so we should make sure that we emit the simple code there. On x64, it seems like the native instructions actually happen to implement isless (NaNs greater than non-NaNs), which is useful elsewhere, but not exactly what we need here because we want to propagate NaNs. But, I think that we can just use the native instruction for findmax since it has the right order: non-NaNs in standard order followed by NaNs. For findmin we can’t use the native instruction, but we can negate each value and find the maximum and then negate the maximum we find. Not sure if I’m missing anything, but this seems to me like it should work.
Note that the native comparisons are not always doing the right thing on x86:
libm/IEEE754: “If one of the arguments is a NaN, the other is returned.”
x86 (Intel manual): “If only one value is a NaN (SNaN or QNaN) for this instruction, the second source operand, either a NaN or a valid floating-point value, is written to the result”
Which IIRC is the actual reason none of this is optimized well.
To annoy all the good folks who want consistent semantics The same is true for the vectorized versions, so at least that is internally consistent.
In any case, if your input is guaranteed free of NaN, the native instruction does exactly what we’d need here. You’ll also get performance parity if you do the blocking right (so that the FMA units are loaded optimally), clang/LLVM (and thus Julia) has a disadvantage compared to gcc here.