Has anyone encountered unstable results from Julia, specifically issues with large variance? For example, when I repeat the following experiment 100 times and record the time taken for each run, I find that there are always a few instances where the time is significantly higher than the average. Does anyone know the reason for this? How can I reduce the variance?
n=10^6;
rlevel = [[0;1;2;3;4;5;6;7;8;9] .//10; [99//100, 999//1000]];
klevel = [[1; 10; 100; 500] .// 10^4; [1;2;3;4;5;6;7;8;9] .// 10]
res = zeros(100,2);
Random.seed!(n)
for i in 1:100
x0 = rand(Float64, n);
x0sort = sort(x0, rev=true);
k = Int64(ceil(n * klevel[1])) # 10
tks = sum(x0sort[1:k]);
r = tks * float(rlevel[1]); # 1
res[i,1] = @elapsed begin
tol::Float64 = 1e-8 # stopping criteria
uL::Float64 = -Inf
diffL::Float64 = vlengthL::Float64 = uR::Float64 = +Inf
l_fold = unew::Float64 = r/k;
uold = max_a::Float64 = maximum(x0);
Ite_init::Integer = 0;
sum_a_r1_old = 0.0;
m_old = 0;
l_fnew = 0.0;
flag = -2;
while true
r1 = x0 .> unew
m = sum(r1)
sum_a_r1 = sum(x0[r1])
l_fnew = (r - sum_a_r1 + m*unew)/k
if unew - l_fnew < tol
flag = 0;
break
end
if m == k
flag = -1;
break
end
if m > k
break
end
Ite_init += 1;
uold = unew;
l_fold = l_fnew
m_old = m;
sum_a_r1_old = sum_a_r1
unew = (uold*m - k*l_fold)/(m-k);
end
end
res[i, 2] = @elapsed begin
r3 = x0.>l_fold
compare_result = (r-k*l_fold) - (sum(x0[r3]) - sum(r3)*l_fold) + k*(uold - l_fold)
flag = -1;
end
end
mean(res, dims=1)
std(res, dims=1)
Without following the basic performance tips, your benchmarks will not be very valuable. Right now your code is working on global variables and is therefore probably very type unstable.
Start with reading this, and then test again.
Nevertheless, you will most likely see performance variability. Julia is not yet well suited for hard realtime workloads.
Additionally, you can use BenchmarkTools.jl instead of making your own benchmarking boilerplate.
I put the code (the body in res[i, 1] = @elapsed begin ... end) into the function and I apply benchmark for this function. The results are as follows:
julia> @benchmark part1(x0, r, k)
BenchmarkTools.Trial: 2252 samples with 1 evaluation.
Range (min β¦ max): 1.744 ms β¦ 9.512 ms β GC (min β¦ max): 0.00% β¦ 77.20%
Time (median): 2.056 ms β GC (median): 0.00%
Time (mean Β± Ο): 2.219 ms Β± 810.661 ΞΌs β GC (mean Β± Ο): 6.37% Β± 11.83%
ββββ
ββββββββββ β βββββββββββββββββββββββββββββββββββββββββββ ββββ β β
1.74 ms Histogram: log(frequency) by time 6.6 ms <
Memory estimate: 7.75 MiB, allocs estimate: 6.
I am confused about there exists some extreme cases, leading to high standard error. I have use @code_warntype to check this function and it seems all right. In addition, the while loop in the function runs only once and then exits. The running time should be stable I think.
Furthermore, maybe I cannot use benchmark in the experiments, because I want to do many independent experiments with different inputs x0.
Your profile is telling you that Garbage Collection is the cause. Most of the iterations donβt stop for garbage collection, but some do, and when it does that dominates the time taken for the calculation.
Thanks for your reply! I record r1 = x0 .> unew since I want to calculate both m = sum(r1) (the number of elements larger than unew) and sum_a_r1 = sum(x0[r1]) (the sum of the elements larger than unew).
Thank you! It seems that the standard error is smaller than before! However, there are still some cases where the time is extremely high. Maybe I should check other parts of the function and avoid allocation. Thanks for your advice!
Also, you can move x0 out of the loop, e.g. with x0 = fill(0.0, n), and use Random.rand!(x0) inside the loop to avoid allocation. Similarly with x0sort, i.e. x0sort = fill(0.0, n) outside the loop, and x0sort .= x0; sort!(x0sort, rev=true) inside. The tks = sum(x0sort[1:k]) should be written as tks = sum(@view x0sort[1:k])) to avoid allocating the 1:k section.
Note also that declaring the type of variables, e.g. tol::Float64 = 1e-8 usually has no positive performance impact. Its only function is that every assignment tol = a is replaced by tol = convert(Float64, a) (and the convert is removed by the compiler if it knows that a is a Float64).
As others mentioned, there are various optimizations you can do: Putting stuff into functions, avoiding globals, reducing allocations.
However, I must ask: Why is the variance of runtime a problem for you?
This is not an idle question, but something you need to think about! What is your performance target?
Reasonable performance targets in different applications are:
Optimize the average (mean / amortized) runtime. This is very appropriate if you need to run your function very often, and you only care about total runtime. Thatβs applicable for batch computation, and is the main focus of julialang.
Optimize the (generalized) median runtime, say p99: Take the slowest 1% of runs, try to minimize the time taken for these slowish guys. This is what you want for e.g. web-apps / interactive things. To get there, you can reduce the allocations in your code until GC is running seldom enough to not mess that up.
Have some time-window, minimize the number (or exclude the possibility) of runs that exceed this time-window. This is appropriate for real-time systems (must finish in 200us, otherwise your control system crashes). Julia (and generally, hardware that can run julia) are not very appropriate for βhard realtimeβ (where missed deadlines must be avoided); opinions differ on βsoft realtimeβ (where missed deadlines must be somewhat rare).
Not true. Linux-RT is hard real-time capable. And companies like ASML, the largest chip machine manufacturer even uses Julia for the control of their machines commercially.
But if you need hard real-time you have to avoid many Julia packages and use only functions that are allocation free. You can use AllocCheck.jl to check if your functions are allocation free.
For soft real-time it can also help to run the garbage collector in partial mode more often, like this:
GC.gc(false)
Then you get for example on every call of your function a small delay, instead of a large delay once in while.
But anything that is timing critical should use Linux and not Windows (donβt know about MAC).
In direction of soft realtime: Something Iβd like to see existing is tooling where you have two redundant julia processes running on the same system, with some shared memory, and then hand-off control between them while the other process is allowed to do house-keeping and GC. So for applications that are soft enough for linux + mainstream cpus and where e.g. JVM with ZGC would be usable. (overprovisioning compute and memory by <2x would often be a small price for solving the GC-pause latency issue)