C implementation of function being ~4 times faster even absence of allocs

Mason · March 4, 2025, 6:28pm

For what it’s worth, I think that for most users and use-cases, naive_findmax is a perfectly appropriate function / implementation to use, so long as one understands that they’re not accounting for some of the corner cases that happen with floats.

It’s just that at a language level, when we write a function like findmax in Base, we have no idea if the user will be sensitive to these things or not, so we typically err on the side of safety here, and try to give better, more robust implementations even if it does cost performance in some circumstances.

It’d be good if we could make @fastmath findmax(itr) do the fast thing for you automatically, but at least for now, it’s quite easy and quick to roll your own.

miguelborrero · March 4, 2025, 6:46pm

Absolutely! Thanks for your comment, Mason.

Palli · March 5, 2025, 10:19am

So now it’s way faster than that C code (why?), but I’m not even sure it’s the limit of speedup.

I was thinking a naive_findmax could be Base, or under findmax with keyword argument (and every time Base isn’t fastest for something, explain faster option available, without taking edge cases into account). But this seems to be a deep rabbit hole and maybe even faster here:

At least document that findmax/min aren’t fastest, and point to alternatives (if not simply adding fastest) likely FindFirstFunctions.jl in their docstrings?

foobar_lv2 · March 5, 2025, 11:37am

FWIW, you should consider something like

julia> function assignment_probabilities_sorted_gauss(utilities,tmp, cutoffs, mean, stddev)
           last_prob = 1.0;
           expectedU = 0.0;
           maxfound = -Inf;
           maxidx = 0
           sch_left = 500;
           for i=1:sch_left
             item = utilities[i]
             if item > maxfound
               maxfound = item
               maxidx = i
             end
             tmp[i] = maxidx
           end

           while sch_left > 0
               ix = tmp[sch_left]
               max_util = utilities[ix]
               if max_util > 0
                   this_prob = cdf_eval(cutoffs[ix], mean, stddev)
                   p = last_prob - this_prob;
                   expectedU += p * max_util;
                   last_prob = this_prob;
                   sch_left = ix - 1;
               else
                   break
               end
           end
           return expectedU
       end

That does give you a speedup from N^2 to N in the worst-case where utilities are monotonically increasing (and you might be able to re-use the data in tmp if you ever want to run this on the same utility vector with different cutoffs/mean/stddev).

StefanKarpinski · March 5, 2025, 3:18pm

I feel like there’s got to be a way to make findmax faster on floats despite the annoyance of NaNs. On ARM the fmax and fmin instructions already do the right thing (propagate NaNs), so we should make sure that we emit the simple code there. On x64, it seems like the native instructions actually happen to implement isless (NaNs greater than non-NaNs), which is useful elsewhere, but not exactly what we need here because we want to propagate NaNs. But, I think that we can just use the native instruction for findmax since it has the right order: non-NaNs in standard order followed by NaNs. For findmin we can’t use the native instruction, but we can negate each value and find the maximum and then negate the maximum we find. Not sure if I’m missing anything, but this seems to me like it should work.

Issue opened: optimize min/max and related for floats · Issue #57647 · JuliaLang/julia · GitHub

Sukera · March 5, 2025, 5:26pm

Note that the native comparisons are not always doing the right thing on x86:

github.com/llvm/llvm-project

[x86] fmax/fmin calls are not optimized with finite/fast math

opened 05:22PM - 30 Jul 15 UTC

closed 11:59PM - 19 Nov 15 UTC

rotateright

bugzilla

| | | | --- | --- | | Bugzilla Link | [24314](https://llvm.org/bz24314) | | Re…solution | FIXED | | Resolved on | Nov 19, 2015 15:59 | | Version | unspecified | | OS | All | | CC | @hfinkel,@RKSimon | ## Extended Description I don't know if this is a front-end bug, a middle-end bug, an x86 back-end bug, or not a bug: $ cat fmax.c #include <math.h> double max1(double x, double y) { return fmax(x, y); } $ ./clang -v clang version 3.8.0 (trunk 243489) Target: x86_64-apple-darwin14.4.0 Thread model: posix $ ./clang -O2 -fomit-frame-pointer -fno-math-errno fmax.c -S -o - ... _max1: ## @max1 jmp _fmax ## TAILCALL I thought that 'maxsd' would be used here given we're ignoring errno (which I think is default for OSX, so I don't even need to explicitly use that flag). Certainly with -ffast-math, we should be able to do this optimization? GCC does: $ gcc -O2 -fomit-frame-pointer -ffast-math fmax.c -S -o - ... maxsd %xmm1, %xmm0 ret

libm/IEEE754: “If one of the arguments is a NaN, the other is returned.”

x86 (Intel manual): “If only one value is a NaN (SNaN or QNaN) for this instruction, the second source operand, either a NaN or a valid floating-point value, is written to the result”

Which IIRC is the actual reason none of this is optimized well.

StefanKarpinski · March 5, 2025, 8:03pm

Oh x86, why do you have to be like this?

Sukera · March 5, 2025, 8:56pm

To annoy all the good folks who want consistent semantics The same is true for the vectorized versions, so at least that is internally consistent.

In any case, if your input is guaranteed free of NaN, the native instruction does exactly what we’d need here. You’ll also get performance parity if you do the blocking right (so that the FMA units are loaded optimally), clang/LLVM (and thus Julia) has a disadvantage compared to gcc here.

Ahmed_Salih · March 5, 2025, 10:42pm

Would FindFirstFunctions.jl be the fastest way?

I got a bit confused in the thread, about what is the fastest now

Topic		Replies	Views
Optimizing Calculation in Julia compared to C (New to Julia) Performance	25	2490	January 1, 2020
Speeding up my code Performance	7	747	February 23, 2022
Comparing performance of 2 simple averaging functions - why is one faster? Performance	5	484	August 31, 2020
Squeeze out the last 10% of performance for a sorting function? Performance sort	26	3471	July 18, 2021
Small benchmark Performance benchmark	14	2675	November 21, 2018

C implementation of function being ~4 times faster even absence of allocs

Related topics