Unstable execution time with high standard error in Julia

Has anyone encountered unstable results from Julia, specifically issues with large variance? For example, when I repeat the following experiment 100 times and record the time taken for each run, I find that there are always a few instances where the time is significantly higher than the average. Does anyone know the reason for this? How can I reduce the variance?

n=10^6;
rlevel = [[0;1;2;3;4;5;6;7;8;9] .//10; [99//100, 999//1000]];
klevel = [[1; 10; 100; 500] .// 10^4; [1;2;3;4;5;6;7;8;9] .// 10]

res = zeros(100,2);
Random.seed!(n)
for i in 1:100
    x0 = rand(Float64, n);
    x0sort = sort(x0, rev=true);
    k = Int64(ceil(n * klevel[1])) # 10
    tks = sum(x0sort[1:k]);
    r = tks * float(rlevel[1]); # 1

    res[i,1] = @elapsed begin
        tol::Float64 = 1e-8 # stopping criteria
        uL::Float64 = -Inf
        diffL::Float64 = vlengthL::Float64 = uR::Float64 = +Inf
        l_fold = unew::Float64 = r/k;
        uold = max_a::Float64 = maximum(x0);
        Ite_init::Integer = 0;
        sum_a_r1_old = 0.0;
        m_old = 0;
        l_fnew = 0.0;
        flag = -2;
      while true
        r1 = x0 .> unew
        m = sum(r1)
        sum_a_r1 = sum(x0[r1])
        l_fnew = (r - sum_a_r1 + m*unew)/k
    
        if unew - l_fnew < tol
            flag = 0;
            break
        end
  
        if m == k
          flag = -1;
          break
        end
        
        if m > k
          break
        end
        
        Ite_init += 1;
        uold = unew;
        l_fold = l_fnew
        m_old = m;
        sum_a_r1_old = sum_a_r1
        
        unew = (uold*m - k*l_fold)/(m-k);

      end
    end

    res[i, 2] = @elapsed begin
        r3 = x0.>l_fold
        compare_result = (r-k*l_fold) - (sum(x0[r3]) - sum(r3)*l_fold) + k*(uold - l_fold)
        flag = -1;
    end
end
mean(res, dims=1)
std(res, dims=1)


I get the following results. Some experiments are not stable, and they need a lot of time.

Without having looked at the code: garbage collector pauses?

If that’s the issue, then you want to reduce memory allocations in hot loops. See Performance Tips Β· The Julia Language, in particular the section Measure performance with @time and pay attention to memory allocation

3 Likes

the primary thing you need to do is put your code in functions

9 Likes

Without following the basic performance tips, your benchmarks will not be very valuable. Right now your code is working on global variables and is therefore probably very type unstable.

Start with reading this, and then test again.

Nevertheless, you will most likely see performance variability. Julia is not yet well suited for hard realtime workloads.

Additionally, you can use BenchmarkTools.jl instead of making your own benchmarking boilerplate.

1 Like

I had put the code in the functions before. However, the results are still unstable.

I put the code (the body in res[i, 1] = @elapsed begin ... end) into the function and I apply benchmark for this function. The results are as follows:

julia> @benchmark part1(x0, r, k)
BenchmarkTools.Trial: 2252 samples with 1 evaluation.
 Range (min … max):  1.744 ms …   9.512 ms  β”Š GC (min … max): 0.00% … 77.20%
 Time  (median):     2.056 ms               β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   2.219 ms Β± 810.661 ΞΌs  β”Š GC (mean Β± Οƒ):  6.37% Β± 11.83%

     β–ˆβ–†β–„β–‚                                                      
  β–†β–ƒβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–…β–…β–ƒβ–β–β–β–β–ƒβ–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–ƒβ–β–β–β–β–β–β–β–β–β–β–β–β–β–β–ƒβ–ƒβ–ƒβ–…β–†β–‡β–‡β–…β–† β–ˆ
  1.74 ms      Histogram: log(frequency) by time       6.6 ms <

 Memory estimate: 7.75 MiB, allocs estimate: 6.

I am confused about there exists some extreme cases, leading to high standard error. I have use @code_warntype to check this function and it seems all right. In addition, the while loop in the function runs only once and then exits. The running time should be stable I think.

Furthermore, maybe I cannot use benchmark in the experiments, because I want to do many independent experiments with different inputs x0.

Your profile is telling you that Garbage Collection is the cause. Most of the iterations don’t stop for garbage collection, but some do, and when it does that dominates the time taken for the calculation.

1 Like

Thanks for your reply.

Can you please give me some advice how to solve this Garbage Collection issue?

You need to avoid allocation. On a quick glance I only see this offending piece:

r1 = x0 .> unew
m = sum(r1)

The broadcast allocates a new array. Try this instead which is equivalent but does not allocate a temporary array

m = count(>(unew), x0)

Thanks for your reply! I record r1 = x0 .> unew since I want to calculate both m = sum(r1) (the number of elements larger than unew) and sum_a_r1 = sum(x0[r1]) (the sum of the elements larger than unew).

Ah, I missed that. In that case I would recommend just writing a loop that does both in one pass without the need for additional memory.

Do you mean we use a for loop to do both? For example,

for xi in x0
    if xi > unew
        m += 1
        sum_a_r1 += xi
    end
end

Yes. This is both more efficient than the previous code and does not allocate additional memory.

Thank you! It seems that the standard error is smaller than before! However, there are still some cases where the time is extremely high. Maybe I should check other parts of the function and avoid allocation. Thanks for your advice!

Also, you can move x0 out of the loop, e.g. with x0 = fill(0.0, n), and use Random.rand!(x0) inside the loop to avoid allocation. Similarly with x0sort, i.e. x0sort = fill(0.0, n) outside the loop, and x0sort .= x0; sort!(x0sort, rev=true) inside. The tks = sum(x0sort[1:k]) should be written as tks = sum(@view x0sort[1:k])) to avoid allocating the 1:k section.

Note also that declaring the type of variables, e.g. tol::Float64 = 1e-8 usually has no positive performance impact. Its only function is that every assignment tol = a is replaced by tol = convert(Float64, a) (and the convert is removed by the compiler if it knows that a is a Float64).

The maximum(x0) can be replaced with x0sort[1].

1 Like

I see! Thanks for your help!

As others mentioned, there are various optimizations you can do: Putting stuff into functions, avoiding globals, reducing allocations.

However, I must ask: Why is the variance of runtime a problem for you?

This is not an idle question, but something you need to think about! What is your performance target?

Reasonable performance targets in different applications are:

  1. Optimize the average (mean / amortized) runtime. This is very appropriate if you need to run your function very often, and you only care about total runtime. That’s applicable for batch computation, and is the main focus of julialang.
  2. Optimize the (generalized) median runtime, say p99: Take the slowest 1% of runs, try to minimize the time taken for these slowish guys. This is what you want for e.g. web-apps / interactive things. To get there, you can reduce the allocations in your code until GC is running seldom enough to not mess that up.
  3. Have some time-window, minimize the number (or exclude the possibility) of runs that exceed this time-window. This is appropriate for real-time systems (must finish in 200us, otherwise your control system crashes). Julia (and generally, hardware that can run julia) are not very appropriate for β€œhard realtime” (where missed deadlines must be avoided); opinions differ on β€œsoft realtime” (where missed deadlines must be somewhat rare).
2 Likes

Not true. Linux-RT is hard real-time capable. And companies like ASML, the largest chip machine manufacturer even uses Julia for the control of their machines commercially.

But if you need hard real-time you have to avoid many Julia packages and use only functions that are allocation free. You can use AllocCheck.jl to check if your functions are allocation free.

For soft real-time it can also help to run the garbage collector in partial mode more often, like this:

GC.gc(false)

Then you get for example on every call of your function a small delay, instead of a large delay once in while.

But anything that is timing critical should use Linux and not Windows (don’t know about MAC).

In direction of soft realtime: Something I’d like to see existing is tooling where you have two redundant julia processes running on the same system, with some shared memory, and then hand-off control between them while the other process is allowed to do house-keeping and GC. So for applications that are soft enough for linux + mainstream cpus and where e.g. JVM with ZGC would be usable. (overprovisioning compute and memory by <2x would often be a small price for solving the GC-pause latency issue)

3 Likes