C implementation of function being ~4 times faster even absence of allocs

miguelborrero · March 3, 2025, 12:04pm

Hi there,

I am in need to optimize as much as possible the function I will present here. I moved it from C to Julia for convenience since the operations around it become much easier but I am getting significantly worst performance in the performance critical function:

using SpecialFunctions 
utilities = rand(500);
cutoffs = rand(500);
# JULIA BENCHMARK
cdf_eval(x, mean, stddev) =  0.5 * (1 + erf((x - mean) / (stddev * sqrt(2.0))))
function assignment_probabilities_sorted_gauss(utilities, cutoffs, mean, stddev)
    last_prob = 1.0;
    expectedU = 0.0;
    sch_left = 500;

    while sch_left > 0
        max_util, ix = findmax(@view utilities[1:sch_left]);
        if max_util > 0
            this_prob = cdf_eval(cutoffs[ix], mean, stddev)
            p = last_prob - this_prob;
            expectedU += p * max_util;
            last_prob = this_prob;
            sch_left = ix - 1;
        else
            break
        end
    end
    return expectedU
end
@benchmark assignment_probabilities_sorted_gauss!($(utilities), $(cutoffs), $0.5, $1.0)
# C BENCHMARK
C_code = """
#include <stddef.h>
#include <math.h>
double normcdf(double x, double mean, double stddev) {
    double z = (x - mean) / stddev;
    return 0.5 * (1.0 + erf(z / sqrt(2.0)));
}
double assignment_probabilities_sorted_gauss(double* utilities, double* cutoffs, size_t size) {
    double last_prob = 1.0;  // Start with full probability
    double expectedU = 0.0;  // Initialize expected utility

    int ix;
    double max_util, this_prob, p;
    double mean = 0.5;      // Extract mean from F
    double stddev = 1.0;    // Extract standard deviation from F

    // While there are still utilities to process
    while (size > 0) {
        max_util = -INFINITY;  // Reset max_util before each iteration

        // Find the maximum utility and its index (zero-based indexing)
        for (int i = 0; i < size; i++) {
            if (utilities[i] > max_util) {
                max_util = utilities[i];
                ix = i;  // Store the index of the maximum utility
            }
        }

        // If the maximum utility is greater than 0, process it
        if (max_util > 0) {
            // Compute the probability using the normal CDF with the given mean and stddev
            this_prob = normcdf(cutoffs[ix], mean, stddev);  // Compute the normal CDF at the cutoff

            // Calculate the probability for this utility
            p = last_prob - this_prob;
            
            // Update expected utility
            expectedU += p * max_util;
            
            // Remove the processed utility and all lower utilities (size is reduced)
            size = ix;  // Only keep utilities up to the maximum index (zero-indexed)

            // Update last_prob for the next iteration
            last_prob = this_prob;
        } else {
            // If the maximum utility is less or equal to zero, break the loop
            break;
        }
    }
    return expectedU;
}
"""
const Clib = tempname()
open(`gcc -fPIC -O3 -msse3 -xc -shared -o $(Clib * "." * Libdl.dlext) -`, "w") do f
    print(f, C_code)
end
c_assignment_probabilities_sorted_gauss(utilities::Array{Float64}, cutoffs::Array{Float64}) = ccall(("assignment_probabilities_sorted_gauss", Clib3), Float64, (Ptr{Float64}, Ptr{Float64}, Csize_t), 
                                        utilities, cutoffs, length(utilities))
@benchmark c_assignment_probabilities_sorted_gauss($(utilities), $(cutoffs))

The benchmark output for Julia:

julia> @benchmark assignment_probabilities_sorted_gauss!($(utilities), $(cutoffs), $(p), $0.5, $1.0)
BenchmarkTools.Trial: 10000 samples with 6 evaluations.
 Range (min … max):  4.993 μs …  21.042 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     5.028 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   5.104 μs ± 585.617 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █▆ ▁▄▁  ▁                                                   ▁
  ███████▇██▇▇█▆▅▅▄▃▅▆▆▅▄▅▄▄▁▄▁▃▄▃▄▃▄▃▃▅▁▁▁▅▄▃▄▄▁▄▁▃▁▄▃▄▁▄▁▄▅ █
  4.99 μs      Histogram: log(frequency) by time       7.4 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

The benchmark output for C:

julia> @benchmark c_assignment_probabilities_sorted_gauss($(utilities), $(cutoffs))
BenchmarkTools.Trial: 10000 samples with 10 evaluations.
 Range (min … max):  1.142 μs …  2.954 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     1.150 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   1.157 μs ± 67.292 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

                     █                                        
  ▂▁▁▁▁▁▁▁▁▄▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▄▇▁▁▁▁▁▁▁▁▄▁▁▁▁▁▁▁▁▁▃▁▁▁▁▁▁▁▁▂ ▂
  1.14 μs        Histogram: frequency by time        1.17 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

I would really appreciate if someone helped out in making these two comparable.
Thanks in advance.

sdanisch · March 3, 2025, 12:08pm

num_schools seems to be a global which is missing from your example?
(This could also be the reason for bad performance, since globals can tank the performance of your whole function)

miguelborrero · March 3, 2025, 12:15pm

sorry for this and thanks for the reply but it does not change the answer. I had it as a constant and just tried replacing it by the value in the function and benchmark is the same.

lmiq · March 3, 2025, 12:15pm

And what about bounds checking? Did you try to add @inbounds to the Julia function?

miguelborrero · March 3, 2025, 12:18pm

if you mean adding @inbounds as pre-fix to the while loop, I just tried this and makes no difference. thanks for the suggestion though.

lmiq · March 3, 2025, 12:28pm

The next thing I would investigate is the dependence of the performance on the compiler flags, on both sides.

(And maybe expand the findmax into a loop as in the C code)

sdanisch · March 3, 2025, 12:32pm

Profiling shows that findmax in Julia does a lot more work then the simple loop for e.g. nan handling…
Translating the C version 1:1 without findmax leads for me to a 12x speedup:

function assignment_probabilities_sorted_gauss!(utilities, cutoffs, mean, stddev)
    last_prob = 1.0
    expectedU = 0.0
    sch_left = length(cutoffs)

    @inbounds while sch_left > 0
        max_util = -Inf
        ix = 0
        # Find the maximum utility and its index
        for i in 1:sch_left
            if utilities[i] > max_util
                max_util = utilities[i]
                ix = i
            end
        end
        if max_util > 0
            this_prob = cdf_eval(cutoffs[ix], mean, stddev)
            p = last_prob - this_prob
            expectedU += p * max_util
            last_prob = this_prob
            sch_left = ix - 1
        else
            break
        end
    end
    return expectedU
end

miguelborrero · March 3, 2025, 12:42pm

Oh wow… this is surprising since I also found findmax to be the largest bottleneck in the profiling and then I compared the performance of the C and Julia findmax parts and they performed the same.

C_code = """
#include <stddef.h>
#include <math.h>

double find_max(size_t n, double *X) {
    double max_util = -INFINITY;
    int ix;
    for (size_t i = 0; i < n; ++i) {
        if (X[i] > max_util) {
            max_util = X[i];
            ix = i;  // Store the index of the maximum utility
        }
    }
    return max_util;
}
"""
const Clib = tempname()
open(`gcc -fPIC -O3 -msse3 -xc -shared -o $(Clib * "." * Libdl.dlext) -`, "w") do f
    print(f, C_code)
end
c_findmax(X::Array{Float64}) = ccall(("find_max", Clib), Float64, (Csize_t, Ptr{Float64}), length(X), X)

julia> @btime c_findmax($(utilities))
  580.126 ns (0 allocations: 0 bytes)
0.9993231992660636

julia> @btime findmax($(utilities))
  611.189 ns (0 allocations: 0 bytes)
(0.9993231992660636, 269)

But you are right and you’re version gets the same performance as the C version in the function of interest so I guess the lesson learnt for me is that 580 ns is small enough compared to 611 ns inside a while loop.

thanks a lot, I appreciate your time.

lmiq · March 3, 2025, 12:45pm

And inbounds there doesn’t make a difference? (@sdanisch used it)

miguelborrero · March 3, 2025, 12:49pm

not in my benchmarks. just tried again. it seems like the speed-up is coming from the findmax circumvent.

jling · March 3, 2025, 1:21pm

there’s a shorter way to write faster (but no NaN handling) findmin and findmax: Faster `findmin` without LoopVectorization.jl - #3 by matthias314

miguelborrero · March 3, 2025, 1:52pm

temp = rand(500)
function naive_findmax(w)
           x = @fastmath foldl(max, w)
           i = findfirst(==(x), w)::Int
           x, i
       end
julia> @btime naive_findmax($(temp))
  507.223 ns (0 allocations: 0 bytes)
(0.9960969077821132, 422)

Thanks a lot! It does make a significant difference!

jling · March 3, 2025, 5:21pm

it’s a shame really, because this is on average looping over the array 1.5 times.

Concretely, we can improve this by having

@fastmath findmax(w)

to ignore IEEE754 NaN rules.

Related if you only want value not the index: Move all platforms to use llvm.minimum/llvm.maximum for fmin/fmax by gbaraldi · Pull Request #56371 · JuliaLang/julia · GitHub

Mason · March 3, 2025, 5:26pm

Yeah, sometimes microbenchmarks of one small part of your code can have very misleading results compared to how that part will perform in the wider context of the surrounding code. This might have to do with the branch predictor being warmed up, caches being full, or any number of other confounding variables.

Great catch @sdanisch!

mcabbott · March 3, 2025, 10:30pm

We could. Note that the case @fastmath maximum(w) was added in Julia 1.10. There is also a PR for extrema and some others, not including findmax or argmax.

sylvaticus · March 4, 2025, 1:17am

Hello, can you expand a bit on this? I am a bit surpriced that a 5% slower inner function results in almost 500% slower outer function…

vchuravy · March 4, 2025, 6:40am

CPUs are complicated beats. They do all kinds of shenanigans, among them being branch-prediction. Branch predictors “learn” the behavior of your code. In a micro benchmark of findmax we essentially over-train the branch predictor on one singular data pattern. In a more realistic benchmark where the data passed to findmax isn’t always the same, the branch predictor is much less effective.

foobar_lv2 · March 4, 2025, 10:38am

cf PSA: Microbenchmarks remember branch history for the effect that @vchuravy alluded to.

Floating point numbers are annoying with nans. It’s quite unsurprising that a generic base julia “findmin” function will kinda suck on floating point numbers:

C/C++ are culturally fine with under-specifying their stuff, i.e. ignoring all corner cases and saying “UB if nan” unless some language-lawyer digs up a piece of spec that hints at a different interpretation (with different compilers / archs doing different things, because why not).
julia is not fine with that! There will be some well-defined and considered behavior on nan values. Whatever this behavior, on some CPUs / data it will be slower than just ignoring the possibility.

That’s not because of a big fault of julia, it’s because “give me the minimum of these numbers” is just a bad request, because it begs the follow-up questions “and which do you want there are nans? If there are multiple nans, which one do you want (nan payload!)? If you also want the indices and the maxindex is non-unique, which one to you want? Suppose the max is zero; do you care about the difference between -0.0 and 0.0?”.

To belabor these points:

julia> naive_findmax([1.0, NaN])
ERROR: TypeError: in typeassert, expected Int64, got a value of type Nothing

julia> w=[-0.0, 0.0];
julia> findmax(w)
(0.0, 2)
julia> naive_findmax(w)
(0.0, 1)

Are you really OK with that?

miguelborrero · March 4, 2025, 5:40pm

Thanks a lot. This is very useful for my general understanding on how to asses benchmarks.

miguelborrero · March 4, 2025, 5:43pm

Thanks for the link and for your detailed response. I appreciate your time. I understand and agree with your point about being a “bad request”. Nevertheless, for my particular application, I’m not too concerned about the example you gave.

Topic		Replies	Views
Optimizing Calculation in Julia compared to C (New to Julia) Performance	25	2500	January 1, 2020
Speeding up my code Performance	7	748	February 23, 2022
Comparing performance of 2 simple averaging functions - why is one faster? Performance	5	489	August 31, 2020
Squeeze out the last 10% of performance for a sorting function? Performance sort	26	3471	July 18, 2021
Small benchmark Performance benchmark	14	2704	November 21, 2018

C implementation of function being ~4 times faster even absence of allocs

Related topics