Fluctuations when measuring execution time of linear algebra code

I’m trying to measure the execution time of linear algebra expressions, and I’m seeing strange patterns in the results. This is a MWE that reproduces this behavior for me:

using DelimitedFiles
using LinearAlgebra
BLAS.set_num_threads(1)

M100 = rand(500, 200)
M101 = rand(500, 200)
M102 = tril(rand(500, 500))
M103 = triu(rand(500, 500))

function fun(M102::Array{Float64,2}, M100::Array{Float64,2}, M101::Array{Float64,2}, M103::Array{Float64,2})
    start = time_ns()
    out = M102+M100*transpose(M101)+transpose(M103);
    finish = time_ns()
    return (finish-start)*1e-9
end

iters = 500
timings = Array{Float64}(undef, iters)
for i = 1:iters
    time = fun(M102, M100, M101, M103)
    timings[i] = time
end
file = open("timings.txt", "w")
writedlm(file, timings)
close(file)

Plotting the results, I get this:

Can anyone explain where those periodic fluctuations come from, and does anyone know if there’s anything I can do to remove them? Given that there is almost a 2x difference between the slow and fast cases, it seems like something is a bit off.

Some more details about the problem:

  • I’m running a large number of similar experiments, and this behavior only shows up for a few of them, so I don’t think this comes from the hardware.
  • I was suspecting that it’s related to garbage collection, because that’s something that may run periodically, but even temporarily disabling garbage collection and calling GC.gc() in between doesn’t remove the pattern. I also used @timev and the garbage collection times didn’t explain the fluctuations.
  • I see similar fluctuations when I run this code on my laptop, but that’s obviously in general a much more noisy system.
  • I don’t want to use something like BenchmarkTools because I’m running the same experiments in other languages (for example Matlab); for the sake of consistency, I need to be able to do the exact same thing across all languages.

Some more details about the setup:

  • I’m running the code on a cluster machine with CentOS 7.6 and Turbo Boost disabled.
  • The Julia version is 1.1.0-DEV.468 (For historic reason. I don’t mind changing that if it helps.)
  • JULIA_NUM_THREADS is set to 1.
1 Like

You’ll want to avoid benchmarking in global scope. Or if you do, use constant bindings.

I don’t thing the benchmark itself is in the global scope? I also don’t see a difference if I use consts.

Unfortunately I can’t offer an explanation but I can reproduce a similar behavior. However, I have much more noise and a much lower variance between results (factor 1.4 vs factor 2 for you). Maybe I’m just observing the lower spikes?

You are probably counting how many garbage collections occur: 0, 1 or occasionally 2.

3 Likes

GC.enable(false) seems to confirm this on my end.

1 Like

Extra confirmation that is related to gc:
Just add the @time to the call as in:

for i = 1:iters
           @time time = fun(M102, M100, M101, M103)
           timings[i] = time
       end

Here is an extract of my output:

  0.003763 seconds (7 allocations: 5.722 MiB)
  0.003211 seconds (7 allocations: 5.722 MiB)
  0.003299 seconds (7 allocations: 5.722 MiB)
  0.005142 seconds (7 allocations: 5.722 MiB, 24.77% gc time)
  0.003884 seconds (7 allocations: 5.722 MiB)
  0.003323 seconds (7 allocations: 5.722 MiB)
  0.003234 seconds (7 allocations: 5.722 MiB)
  0.005161 seconds (7 allocations: 5.722 MiB, 26.92% gc time)
  0.003863 seconds (7 allocations: 5.722 MiB)
  0.003247 seconds (7 allocations: 5.722 MiB)
  0.003192 seconds (7 allocations: 5.722 MiB)
  0.004971 seconds (7 allocations: 5.722 MiB, 25.71% gc time)
  0.003903 seconds (7 allocations: 5.722 MiB)
  0.003457 seconds (7 allocations: 5.722 MiB)
  0.003231 seconds (7 allocations: 5.722 MiB)
  0.004885 seconds (7 allocations: 5.722 MiB, 26.29% gc time)
  0.003905 seconds (7 allocations: 5.722 MiB)
  0.003368 seconds (7 allocations: 5.722 MiB)
  0.003190 seconds (7 allocations: 5.722 MiB)
  0.004864 seconds (7 allocations: 5.722 MiB, 26.04% gc time)
  0.003854 seconds (7 allocations: 5.722 MiB)
  0.003265 seconds (7 allocations: 5.722 MiB)
  0.003272 seconds (7 allocations: 5.722 MiB)

You need to get rid of the temporary allocations.
What about:

function fun!(out, M102::Array{Float64,2}, M100::Array{Float64,2}, M101::Array{Float64,2}, M103::Array{Float64,2})
           start = time_ns()
           mul!(out, M100, transpose(M101))
           out .+= M102.+transpose(M103) 
           finish = time_ns()
           return (finish-start)*1e-9
       end

out = similar(M103)
for i = 1:iters
    @time time = fun!(out, M102, M100, M101, M103)
    timings[i] = time
end

giving:

...
  0.002901 seconds (1 allocation: 16 bytes)
  0.002947 seconds (1 allocation: 16 bytes)
  0.003033 seconds (1 allocation: 16 bytes)
  0.002971 seconds (1 allocation: 16 bytes)
  0.002938 seconds (1 allocation: 16 bytes)
  0.002950 seconds (1 allocation: 16 bytes)
  0.002954 seconds (1 allocation: 16 bytes)
  0.002952 seconds (1 allocation: 16 bytes)
  0.002914 seconds (1 allocation: 16 bytes)
  0.002918 seconds (1 allocation: 16 bytes)
  0.003057 seconds (1 allocation: 16 bytes)
  0.002959 seconds (1 allocation: 16 bytes)
  0.003292 seconds (1 allocation: 16 bytes)
  0.003163 seconds (1 allocation: 16 bytes)
  0.002953 seconds (1 allocation: 16 bytes)
  0.002896 seconds (1 allocation: 16 bytes)
  0.002979 seconds (1 allocation: 16 bytes)

Thank you for your input. I changed two things: I’m using a main function, and I’m disabling garbage collection (see below).

using DelimitedFiles
using LinearAlgebra
BLAS.set_num_threads(1)

function fun(M102::Array{Float64,2}, M100::Array{Float64,2}, M101::Array{Float64,2}, M103::Array{Float64,2})
  GC.enable(false)
  start = time_ns()
  out = M102+M100*transpose(M101)+transpose(M103);
  finish = time_ns()
  GC.enable(true)
  return (finish-start)*1e-9
end

function main()
  M100 = rand(500, 200)
  M101 = rand(500, 200)
  M102 = tril(rand(500, 500))
  M103 = triu(rand(500, 500))

  iters = 500
  timings = Array{Float64}(undef, iters)
  for i = 1:iters
    time = fun(M102, M100, M101, M103)
    timings[i] = time
  end
  file = open("timings.txt", "w")
  writedlm(file, timings)
  close(file)
end

main()

Now I get this:

Unknown

I didn’t change the range of the y-axis. Compared to the previous plot, it looks like now we mostly have slow cases, and we never reach 0.0035s. Doesn’t that contradict the assumption that the slow cases were previously caused by garbage collection?

1 Like

Unfortunately, that defeats the purpose of the experiment. We’re trying to measure how well Julia does for expressions such as M102+M100*transpose(M101)+transpose(M103) without rewriting them in a smart way.

1 Like

Given that GC occurs in lumps, you will then just have to take this into account for your results, and give up the requirement of having near-equal timings for every run.

When asked to simplify, AFAIK BenchmarkTools returns the minimum time elapsed, but the average/median time may be more informative for your case.

I rerun all of my experiments with the two changes we talked about (using main and GC.enable(false)). Overall, those changes seems to make things better, but there are still a few test cases that keep showing those fluctuations. I tried to reproduce those fluctuations with another MWE, adding some things that we do in our actual experiments:

  • This is a different test problem.
  • We put the operands into a tuple and use tuple unpacking when calling the function. We do this because it makes the implementation of our experiments simpler.
  • We call GC.gc() before disabling garbage collection. This seems to make the results more stable (if they are stable at all).

The code looks like this:

using DelimitedFiles
using LinearAlgebra
BLAS.set_num_threads(1)

function fun(M109::Array{Float64,2}, M110::Diagonal{Float64,Array{Float64,1}}, M111::Array{Float64,2}, M112::Array{Float64,2}, v113::Array{Float64,1})
  GC.gc()
  GC.enable(false)
  start = time_ns()
  out = transpose(M109)*M110*(M111+transpose(M112))*v113;
  finish = time_ns()
  GC.enable(true)
  return (tuple(out), (finish-start)*1e-9)
end

function main()

  M109 = rand(1150,1300)
  M110 = Diagonal(rand(1150))
  M111 = rand(1150,150)
  M112 = rand(150,1150)
  v113 = rand(150)
  matrices = (M109, M110, M111, M112, v113)

  iters = 500
  timings = Array{Float64}(undef, iters)
  for i = 1:iters
    out, time = fun(matrices...)
    timings[i] = time
  end
  file = open("timings.txt", "w")
  writedlm(file, timings)
  close(file)
end

main()

Now there is some element of randomness, so sometimes it’s very stable, and sometimes there are again those fluctuations:


Notice that again, when it’s stable, it’s running relatively “slow”, except for a few fast iterations in the beginning. The slow cases are about 1.4x slower than the fast cases. What could cause this behavior?

I agree in the sense that if this is how Julia behaves, this is what we should report. However, we run a lot of tests and also compute speedups; It would be useful to just have a single number to summarize those result, but I find it troublesome to summarize such results with a single number (no matter if it’s minimum, average or median).

Summarizing a distribution with a few numbers is pretty much the fundamental question of statistics :wink: But of course this requires thinking about the context a bit — what would you use these benchmarks for?

If you care about expected runtime, the sample mean is a good estimator of that.

Or if you are concerned about tasks taking too long occasionally, then the maximum or the top 1% is useful, too.

Even when you disable garbage collection, allocation still has to happen, and that could have a stochastic cost (just a guess though).

1 Like

We want to compare different implementations of linear algebra problems, both within Julia, and between different languages. So the comparison needs to be “fair”, and I’m afraid there is not simple answer for what exactly that means, especially when comparing different languages.

I noticed those fluctuations because we were trying to adaptively choose the number of iterations based on a test that computes a non-parametric confidence interval for the median (described here, using “Rule 5” and computing the confidence interval as described in Sec. 3.1.3). With those fluctuations, the number of iterations required to achieve a given bound on the confidence interval becomes very large, or it may never be reached at all.

It might be possible to increase the width of that confidence interval to a point where things work, but with a 1.4x difference this would probably be a bit excessive.

It seems like the state of the memory (and with that I don’t mean caching) has a non-trivial impact on the execution time of those pieces of code. Those fluctuations could then be explained as the result of cycling between a number of more or less favorable states. I did another small experiment that perhaps supports this suspicion: I replaced GC.gc() with

if rand() > 0.5
  GC.gc()
end

and the fluctuations are not periodic anymore:

Edit:

Further supporting the suspicion that it’s something with memory: I used Sys.free_memory() to get the amount of free memory after calling GC.gc() and before starting the timer. When plotting free memory and execution time, it’s obvious that there is a connection:

The unit for the amount of memory in the plot should be GB. I think there is an error in the documentation: According to it, Sys.free_memory() returns kilobytes, but it’s actually bytes. Otherwise, the numbers don’t make sense. I’m only showing iterations 100–150 to make it easier to see the patterns.

2 Likes