Unstable execution time with high standard error in Julia

PanT12 · May 24, 2025, 3:16pm

Has anyone encountered unstable results from Julia, specifically issues with large variance? For example, when I repeat the following experiment 100 times and record the time taken for each run, I find that there are always a few instances where the time is significantly higher than the average. Does anyone know the reason for this? How can I reduce the variance?

n=10^6;
rlevel = [[0;1;2;3;4;5;6;7;8;9] .//10; [99//100, 999//1000]];
klevel = [[1; 10; 100; 500] .// 10^4; [1;2;3;4;5;6;7;8;9] .// 10]

res = zeros(100,2);
Random.seed!(n)
for i in 1:100
    x0 = rand(Float64, n);
    x0sort = sort(x0, rev=true);
    k = Int64(ceil(n * klevel[1])) # 10
    tks = sum(x0sort[1:k]);
    r = tks * float(rlevel[1]); # 1

    res[i,1] = @elapsed begin
        tol::Float64 = 1e-8 # stopping criteria
        uL::Float64 = -Inf
        diffL::Float64 = vlengthL::Float64 = uR::Float64 = +Inf
        l_fold = unew::Float64 = r/k;
        uold = max_a::Float64 = maximum(x0);
        Ite_init::Integer = 0;
        sum_a_r1_old = 0.0;
        m_old = 0;
        l_fnew = 0.0;
        flag = -2;
      while true
        r1 = x0 .> unew
        m = sum(r1)
        sum_a_r1 = sum(x0[r1])
        l_fnew = (r - sum_a_r1 + m*unew)/k
    
        if unew - l_fnew < tol
            flag = 0;
            break
        end
  
        if m == k
          flag = -1;
          break
        end
        
        if m > k
          break
        end
        
        Ite_init += 1;
        uold = unew;
        l_fold = l_fnew
        m_old = m;
        sum_a_r1_old = sum_a_r1
        
        unew = (uold*m - k*l_fold)/(m-k);

      end
    end

    res[i, 2] = @elapsed begin
        r3 = x0.>l_fold
        compare_result = (r-k*l_fold) - (sum(x0[r3]) - sum(r3)*l_fold) + k*(uold - l_fold)
        flag = -1;
    end
end
mean(res, dims=1)
std(res, dims=1)

PanT12 · May 24, 2025, 3:17pm

I get the following results. Some experiments are not stable, and they need a lot of time.

giordano · May 24, 2025, 3:23pm

Without having looked at the code: garbage collector pauses?

If that’s the issue, then you want to reduce memory allocations in hot loops. See Performance Tips · The Julia Language, in particular the section Measure performance with @time and pay attention to memory allocation

Oscar_Smith · May 24, 2025, 4:18pm

the primary thing you need to do is put your code in functions

DNF · May 24, 2025, 5:57pm

Without following the basic performance tips, your benchmarks will not be very valuable. Right now your code is working on global variables and is therefore probably very type unstable.

Start with reading this, and then test again.

Nevertheless, you will most likely see performance variability. Julia is not yet well suited for hard realtime workloads.

Additionally, you can use BenchmarkTools.jl instead of making your own benchmarking boilerplate.

PanT12 · May 25, 2025, 1:41am

I had put the code in the functions before. However, the results are still unstable.

PanT12 · May 25, 2025, 2:28am

I put the code (the body in res[i, 1] = @elapsed begin ... end) into the function and I apply benchmark for this function. The results are as follows:

julia> @benchmark part1(x0, r, k)
BenchmarkTools.Trial: 2252 samples with 1 evaluation.
 Range (min … max):  1.744 ms …   9.512 ms  ┊ GC (min … max): 0.00% … 77.20%
 Time  (median):     2.056 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   2.219 ms ± 810.661 μs  ┊ GC (mean ± σ):  6.37% ± 11.83%

     █▆▄▂                                                      
  ▆▃██████▇▅▅▃▁▁▁▁▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃▃▃▅▆▇▇▅▆ █
  1.74 ms      Histogram: log(frequency) by time       6.6 ms <

 Memory estimate: 7.75 MiB, allocs estimate: 6.

I am confused about there exists some extreme cases, leading to high standard error. I have use @code_warntype to check this function and it seems all right. In addition, the while loop in the function runs only once and then exits. The running time should be stable I think.

Furthermore, maybe I cannot use benchmark in the experiments, because I want to do many independent experiments with different inputs x0.

Satvik · May 25, 2025, 3:20am

Your profile is telling you that Garbage Collection is the cause. Most of the iterations don’t stop for garbage collection, but some do, and when it does that dominates the time taken for the calculation.

PanT12 · May 25, 2025, 4:21am

Thanks for your reply.

Can you please give me some advice how to solve this Garbage Collection issue?

abraemer · May 25, 2025, 2:09pm

You need to avoid allocation. On a quick glance I only see this offending piece:

r1 = x0 .> unew
m = sum(r1)

The broadcast allocates a new array. Try this instead which is equivalent but does not allocate a temporary array

m = count(>(unew), x0)

PanT12 · May 25, 2025, 2:16pm

Thanks for your reply! I record r1 = x0 .> unew since I want to calculate both m = sum(r1) (the number of elements larger than unew) and sum_a_r1 = sum(x0[r1]) (the sum of the elements larger than unew).

abraemer · May 25, 2025, 2:19pm

Ah, I missed that. In that case I would recommend just writing a loop that does both in one pass without the need for additional memory.

PanT12 · May 25, 2025, 2:41pm

Do you mean we use a for loop to do both? For example,

for xi in x0
    if xi > unew
        m += 1
        sum_a_r1 += xi
    end
end

abraemer · May 25, 2025, 2:54pm

Yes. This is both more efficient than the previous code and does not allocate additional memory.

PanT12 · May 25, 2025, 3:22pm

Thank you! It seems that the standard error is smaller than before! However, there are still some cases where the time is extremely high. Maybe I should check other parts of the function and avoid allocation. Thanks for your advice!

sgaure · May 25, 2025, 3:59pm

Also, you can move x0 out of the loop, e.g. with x0 = fill(0.0, n), and use Random.rand!(x0) inside the loop to avoid allocation. Similarly with x0sort, i.e. x0sort = fill(0.0, n) outside the loop, and x0sort .= x0; sort!(x0sort, rev=true) inside. The tks = sum(x0sort[1:k]) should be written as tks = sum(@view x0sort[1:k])) to avoid allocating the 1:k section.

Note also that declaring the type of variables, e.g. tol::Float64 = 1e-8 usually has no positive performance impact. Its only function is that every assignment tol = a is replaced by tol = convert(Float64, a) (and the convert is removed by the compiler if it knows that a is a Float64).

The maximum(x0) can be replaced with x0sort[1].

PanT12 · May 26, 2025, 8:31am

I see! Thanks for your help!

foobar_lv2 · May 26, 2025, 10:21am

As others mentioned, there are various optimizations you can do: Putting stuff into functions, avoiding globals, reducing allocations.

However, I must ask: Why is the variance of runtime a problem for you?

This is not an idle question, but something you need to think about! What is your performance target?

Reasonable performance targets in different applications are:

Optimize the average (mean / amortized) runtime. This is very appropriate if you need to run your function very often, and you only care about total runtime. That’s applicable for batch computation, and is the main focus of julialang.
Optimize the (generalized) median runtime, say p99: Take the slowest 1% of runs, try to minimize the time taken for these slowish guys. This is what you want for e.g. web-apps / interactive things. To get there, you can reduce the allocations in your code until GC is running seldom enough to not mess that up.
Have some time-window, minimize the number (or exclude the possibility) of runs that exceed this time-window. This is appropriate for real-time systems (must finish in 200us, otherwise your control system crashes). Julia (and generally, hardware that can run julia) are not very appropriate for “hard realtime” (where missed deadlines must be avoided); opinions differ on “soft realtime” (where missed deadlines must be somewhat rare).

ufechner7 · May 26, 2025, 10:50am

Not true. Linux-RT is hard real-time capable. And companies like ASML, the largest chip machine manufacturer even uses Julia for the control of their machines commercially.

But if you need hard real-time you have to avoid many Julia packages and use only functions that are allocation free. You can use AllocCheck.jl to check if your functions are allocation free.

For soft real-time it can also help to run the garbage collector in partial mode more often, like this:

GC.gc(false)

Then you get for example on every call of your function a small delay, instead of a large delay once in while.

But anything that is timing critical should use Linux and not Windows (don’t know about MAC).

foobar_lv2 · May 26, 2025, 12:06pm

In direction of soft realtime: Something I’d like to see existing is tooling where you have two redundant julia processes running on the same system, with some shared memory, and then hand-off control between them while the other process is allowed to do house-keeping and GC. So for applications that are soft enough for linux + mainstream cpus and where e.g. JVM with ZGC would be usable. (overprovisioning compute and memory by <2x would often be a small price for solving the GC-pause latency issue)

Topic		Replies	Views
Why does my program's performance vary so much from run to run? Can it be fixed? Performance question , memory-allocation , optimization	11	1571	January 20, 2021
Fluctuations when measuring execution time of linear algebra code Performance	12	1047	July 19, 2019
How much is it normal that @time differs in time? New to Julia	20	2407	June 28, 2017
Poor performance due to memory allocations? Performance memory-allocation	17	2875	January 15, 2019
Push! versus preallocation New to Julia	17	2803	June 11, 2020

Unstable execution time with high standard error in Julia

Related topics