Why does my program's performance vary so much from run to run? Can it be fixed?

While I was looking into optimizing my pure Julia program, I noticed something very unpleasant: its run time varies a great deal between some runs even though its input and code stay the same, and even though the system had plenty of available CPU and RAM resources. The code’s here: https://github.com/nsajko/discrete-math-lab3/blob/master/lab3.jl

You can imagine that this high variance makes measuring improvements to the code very difficult, as I would have to make very many measurements to make sense of the data for the performance changes (to see if I made an optimization or a pessimization).

I’m using Julia 1.5.3 and LLVM 10.0.1 (Archlinux).

Concrete data:

These are a couple of Julia sessions which use the @time macro to show how the performance can vary a great deal from run to run. The minimal time was around 7.5 seconds, and the maximal time was around 14 seconds, but often times the immediately following timing is about the same as the one before.

  (_)     | (_) (_)    |
   _ _   _| |_  __ _   |  Type "?" for help, "]?" for Pkg help.
  | | | | | | |/ _` |  |
  | | |_| | | | (_| |  |  Version 1.5.3 (2020-11-09)
 _/ |\__'_|_|_|\__'_|  |
|__/                   |

julia> include("lab3.jl")
Main.Lab3

julia> f = "hc/15.txt"
"hc/15.txt"

julia> @time Lab3.prChrNum(f)
6
 13.189632 seconds (17.19 k allocations: 1.335 MiB)

julia> @time Lab3.prChrNum(f)
6
 13.409172 seconds (4.07 k allocations: 695.953 KiB)

julia>
$ julia --color=no -O3
               _
   _       _ _(_)_     |  Documentation: https://docs.julialang.org
  (_)     | (_) (_)    |
   _ _   _| |_  __ _   |  Type "?" for help, "]?" for Pkg help.
  | | | | | | |/ _` |  |
  | | |_| | | | (_| |  |  Version 1.5.3 (2020-11-09)
 _/ |\__'_|_|_|\__'_|  |
|__/                   |

julia> include("lab3.jl")
Main.Lab3

julia> f = "hc/15.txt"
"hc/15.txt"

julia> @time Lab3.prChrNum(f)
6
 14.162228 seconds (17.19 k allocations: 1.335 MiB)

julia> @time Lab3.prChrNum(f)
6
 14.025255 seconds (4.07 k allocations: 695.953 KiB)

julia> @time Lab3.prChrNum(f)
6
 14.165792 seconds (4.07 k allocations: 695.625 KiB)

julia>
$ julia --color=no -O3
               _
   _       _ _(_)_     |  Documentation: https://docs.julialang.org
  (_)     | (_) (_)    |
   _ _   _| |_  __ _   |  Type "?" for help, "]?" for Pkg help.
  | | | | | | |/ _` |  |
  | | |_| | | | (_| |  |  Version 1.5.3 (2020-11-09)
 _/ |\__'_|_|_|\__'_|  |
|__/                   |

julia> include("lab3.jl")
Main.Lab3

julia> f = "hc/15.txt"
"hc/15.txt"

julia> @time Lab3.prChrNum(f)
6
 14.177292 seconds (17.19 k allocations: 1.335 MiB)

julia> @time Lab3.prChrNum(f)
6
 14.152514 seconds (4.07 k allocations: 695.953 KiB)

julia> @time Lab3.prChrNum(f)
6
 14.135292 seconds (4.07 k allocations: 695.625 KiB)

julia> @time Lab3.prChrNum(f)
6
 14.312558 seconds (4.07 k allocations: 695.625 KiB)

julia> @time Lab3.prChrNum(f)
6
  8.205959 seconds (4.07 k allocations: 695.625 KiB)

julia> @time Lab3.prChrNum(f)
6
  8.138033 seconds (4.07 k allocations: 695.625 KiB)

julia> @time Lab3.prChrNum(f)
6
 13.450250 seconds (4.07 k allocations: 695.953 KiB)

julia> @time Lab3.prChrNum(f)
6
 13.844985 seconds (4.07 k allocations: 695.625 KiB, 0.03% gc time)

julia>
$ julia --color=no -O3 --check-bounds=no --track-allocation=none
               _
   _       _ _(_)_     |  Documentation: https://docs.julialang.org
  (_)     | (_) (_)    |
   _ _   _| |_  __ _   |  Type "?" for help, "]?" for Pkg help.
  | | | | | | |/ _` |  |
  | | |_| | | | (_| |  |  Version 1.5.3 (2020-11-09)
 _/ |\__'_|_|_|\__'_|  |
|__/                   |

julia> using Profile, StatProfilerHTML

julia> f = "hc/15.txt"
"hc/15.txt"

julia> include("lab3.jl")
Main.Lab3

julia> @time Lab3.prChrNum(f)
6
  7.545629 seconds (17.19 k allocations: 1.334 MiB)

julia> @time Lab3.prChrNum(f)
6
  7.532154 seconds (4.07 k allocations: 695.625 KiB)

julia> @profile Lab3.prChrNum(f)
6

This is the input file for my program (“hc/15.txt”):

15

0 0 1 1 1 1 0 0 1 0 1 0 1 0 1
0 0 1 0 0 1 1 1 1 0 0 0 1 1 1
1 1 0 0 1 0 1 1 1 1 1 0 1 1 0
1 0 0 0 1 1 1 0 1 0 1 1 0 1 0
1 0 1 1 0 1 0 1 1 1 1 1 1 1 1
1 1 0 1 1 0 1 0 1 0 1 1 1 0 1
0 1 1 1 0 1 0 0 0 1 1 1 0 1 1
0 1 1 0 1 0 0 0 1 1 1 1 0 1 0
1 1 1 1 1 1 0 1 0 0 0 0 1 0 1
0 0 1 0 1 0 1 1 0 0 0 1 1 1 0
1 0 1 1 1 1 1 1 0 0 0 1 1 1 1
0 0 0 1 1 1 1 1 0 1 1 0 1 0 1
1 1 1 0 1 1 0 0 1 1 1 1 0 1 1
0 1 1 1 1 0 1 1 0 1 1 0 1 0 1
1 1 0 0 1 1 1 0 1 0 1 1 1 1 0

The data from the profiling that I did at the end of the last session shown above:

I hope that someone can explain what is Julia messing up to make the run time go from 7.5 seconds to 14 seconds, and what can I do to keep it around 8 seconds. Hopefully without changing my code.

1 Like

Tangentially related question: I’m having trouble with some parts of the profiling report: it seems like function iterate appears at some unexpected places in the profile. E.g:

This is the data for line 64 of lab3.jl, which consists of this code: (cn < min) && (min = cn). As you can see there is no for loop in this line, so I don’t understand what is the connection between function iterate and this line. This even seems like it could be a bug in the profiler?

(BTW, cn and min are Ints.)

The quick answer is to use @btime or @benchmark from BenchmarkTools, which will report both minimum, maximum and mean times. I am not sure how much can vary the times of single runs in general, but if you look at the output of that, generally there is indeed a huge variation.

1 Like

The problem here is that the distribution of the timings seems unacceptably bad. Usually (from my experience with C and C++) the performance measurements will with a huge majority fit into several “levels” where each level is a steep spike around a local maximum in the probability density function and each successive maximum of the PDF is significantly lower than the previous one. Here’s an example of what I’m talking about (the data is for some unrelated programs): 1023-1013.svg - Google Drive Each plot in that SVG is the PDF of the measurements of the run time duration of a distinct implementation of a certain program (the goal was to compare the different implementations).

The situation here (with Julia and my program) is different, firstly because 14 is so great compared to 8, and, secondly, because values around 14 do not appear a lot less frequently than values around 8; in fact it even seems that values around 14 appear more frequently than values around 8 do.

This whole thing indicates that there’s something terribly wrong in either Julia or the way I use Julia, so I was hoping that because I’m a newbie there would be something like that obvious to other participants of this forum. There would be little sense in attempting benchmarking if I cannot get to the bottom of this.

EDIT: in other words, it seems like events that really ruin performance happen very frequently while I’m executing my program.

3 Likes

I just confirm that it happens here in my machine as well (with your code at least).

1 Like

Did you try to compare a profiling of a fast run to a profiling of a slow run? The time distribution around the called functions/operators is the same (they all decrease in the same amount) or the time spent on specific parts of the code that go down?

2 Likes

Just to be sure (didn’t saw it in the code) you (and the functions you use) aren’t using any random algorithm, isn’t it ?

As far as I know, my algorithms are deterministic. One thing comes to mind though, maybe Julia itself does some randomization in some cases. For example, in Go(lang), “The iteration order over maps is not specified and is not guaranteed to be the same from one iteration to the next.”. Maybe some similar intentional randomization exists in Julia, in particular I wonder if the iteration order over BitSets is specified in Julia?

Just for curiosity, I ran a simple test that takes more or less the same time. Both were run 50 times, and I got these results:

The distributions do not seem very different from each other. I don’t know if this is normal, or not. I ran each execution in an independent julia section with julia script.jl, but measured the time including compilation with @time. In these 50 runs I didn’t get any “8s” for your script, although I have seen it :slight_smile: .

I cannot say if there is some problem there, or not.

The test code is simply this:

function f(x)
  s = 0.
  for i in 1:size(x,1)
    for j in 1:size(x,2)
      if i%j > 2
        s += sin(x[i,j])
      end
    end
  end
  s
end

x = rand(50_000,10_000)
@time f(x)
1 Like

I acted on your suggestion, and, in short, the results seem to be that the flamegraph looks about the same independently of the duration of the run time.

First I ran Julia with julia --color=no -O3 --check-bounds=no --track-allocation=none. Then I ran the program 10-15 times like this: @time @profile Lab3.prChrNum(f).

The minimum duration (or close to it) I got was 9.7 seconds, with this flamegraph report:

The maximum duration (or close to it) I got was 17.3 seconds, with this flamegraph report:

The flamegraphs for 12.6s and 15.8s also look very similar to the above.

Just to be sure, did you remember to clean the profiled data between the runs?

The fact the two have basically the same time distribution but different overall times is considerably strange. If your data was larger I would bet on memory trashing, but this should not be the case with so little memory in use. @lmiq’s answer point in the direction this is common behavior, but his example uses a lot more memory and may be susceptible to another programs affecting cache. I suppose you did not have anything else running in parallel?

I think you can combine perf with Julia. perf give some extra stats that could help to elucidate the case, as average number of instructions per cycle and cache misses. If the average number of instructions is different between the two then somehow the core was running slower (either by thermal underclock or maybe the physical core had one logical core allocated to your run and another for other process), if the number of cache misses is different then other application can have trashed the cache in the worse runs.

1 Like

Now I ran an example that uses no memory at all, and the equivalent code in Fortran (time measured with system’s time here). There is no clear difference between the variations in time measures. Also, running both sets side by side, there is a clear correlation of the times of one set (Julia) with the other set (Fortran), indicating that there is a system-wide thing going on, associated with the fluctuations.

The no-mem Julia code is:

code
function f()
  s = 0.
  for i in 1:50_000
    x = sin(i)
    for j in 1:20_000
      y = cos(j)
      if i%j > 2
        s += x*y
      end
    end
  end
  s
end

@time f()

And the Fortran equivalent is:

code
program main
  integer :: i, j
  double precision :: s, x, y
  s = 0.
  do i = 1,50000
    x = dsin(dble(i))
    do j = 1,20000
      y = dcos(dble(j))
      if(mod(i,j) > 2) then
        s = s + x*y
      end if
    end do
  end do
  write(*,*) s
end program main


Thus, my conclusion is that benchmarking is hard :neutral_face:

edit: Complementing the previous experiment. Now I ran 100 times each of the two calculations above, side by side. The result is:

plotimes

Note that there is a clear correlation between the launch time and the execution time for each language. That is, something happened system-wide that led both to be slower. I have here 4 cores, and these programs were using two of them only (I kept working with a spreadsheet in the meanwhile). The dispersion of the times does not seem to the significantly different from each other. Note also that the first and last executions appear to be faster, so even if there are more cores than programs being executed in the test, they appear to compete with each other (when the Fortran test finished, the Julia executions got quite faster, apparently).

So, this dispersion of execution times does not seem to be any particular of Julia.

6 Likes