While I was looking into optimizing my pure Julia program, I noticed something very unpleasant: its run time varies a great deal between some runs even though its input and code stay the same, and even though the system had plenty of available CPU and RAM resources. The code’s here: https://github.com/nsajko/discrete-math-lab3/blob/master/lab3.jl
You can imagine that this high variance makes measuring improvements to the code very difficult, as I would have to make very many measurements to make sense of the data for the performance changes (to see if I made an optimization or a pessimization).
I’m using Julia 1.5.3 and LLVM 10.0.1 (Archlinux).
Concrete data:
These are a couple of Julia sessions which use the @time macro to show how the performance can vary a great deal from run to run. The minimal time was around 7.5 seconds, and the maximal time was around 14 seconds, but often times the immediately following timing is about the same as the one before.
I hope that someone can explain what is Julia messing up to make the run time go from 7.5 seconds to 14 seconds, and what can I do to keep it around 8 seconds. Hopefully without changing my code.
Tangentially related question: I’m having trouble with some parts of the profiling report: it seems like function iterate appears at some unexpected places in the profile. E.g:
This is the data for line 64 of lab3.jl, which consists of this code: (cn < min) && (min = cn). As you can see there is no for loop in this line, so I don’t understand what is the connection between function iterate and this line. This even seems like it could be a bug in the profiler?
The quick answer is to use @btime or @benchmark from BenchmarkTools, which will report both minimum, maximum and mean times. I am not sure how much can vary the times of single runs in general, but if you look at the output of that, generally there is indeed a huge variation.
The problem here is that the distribution of the timings seems unacceptably bad. Usually (from my experience with C and C++) the performance measurements will with a huge majority fit into several “levels” where each level is a steep spike around a local maximum in the probability density function and each successive maximum of the PDF is significantly lower than the previous one. Here’s an example of what I’m talking about (the data is for some unrelated programs): 1023-1013.svg - Google Drive Each plot in that SVG is the PDF of the measurements of the run time duration of a distinct implementation of a certain program (the goal was to compare the different implementations).
The situation here (with Julia and my program) is different, firstly because 14 is so great compared to 8, and, secondly, because values around 14 do not appear a lot less frequently than values around 8; in fact it even seems that values around 14 appear more frequently than values around 8 do.
This whole thing indicates that there’s something terribly wrong in either Julia or the way I use Julia, so I was hoping that because I’m a newbie there would be something like that obvious to other participants of this forum. There would be little sense in attempting benchmarking if I cannot get to the bottom of this.
EDIT: in other words, it seems like events that really ruin performance happen very frequently while I’m executing my program.
Did you try to compare a profiling of a fast run to a profiling of a slow run? The time distribution around the called functions/operators is the same (they all decrease in the same amount) or the time spent on specific parts of the code that go down?
As far as I know, my algorithms are deterministic. One thing comes to mind though, maybe Julia itself does some randomization in some cases. For example, in Go(lang), “The iteration order over maps is not specified and is not guaranteed to be the same from one iteration to the next.”. Maybe some similar intentional randomization exists in Julia, in particular I wonder if the iteration order over BitSets is specified in Julia?
The distributions do not seem very different from each other. I don’t know if this is normal, or not. I ran each execution in an independent julia section with julia script.jl, but measured the time including compilation with @time. In these 50 runs I didn’t get any “8s” for your script, although I have seen it .
I cannot say if there is some problem there, or not.
The test code is simply this:
function f(x)
s = 0.
for i in 1:size(x,1)
for j in 1:size(x,2)
if i%j > 2
s += sin(x[i,j])
end
end
end
s
end
x = rand(50_000,10_000)
@time f(x)
I acted on your suggestion, and, in short, the results seem to be that the flamegraph looks about the same independently of the duration of the run time.
First I ran Julia with julia --color=no -O3 --check-bounds=no --track-allocation=none. Then I ran the program 10-15 times like this: @time @profile Lab3.prChrNum(f).
The minimum duration (or close to it) I got was 9.7 seconds, with this flamegraph report:
Just to be sure, did you remember to clean the profiled data between the runs?
The fact the two have basically the same time distribution but different overall times is considerably strange. If your data was larger I would bet on memory trashing, but this should not be the case with so little memory in use. @lmiq’s answer point in the direction this is common behavior, but his example uses a lot more memory and may be susceptible to another programs affecting cache. I suppose you did not have anything else running in parallel?
I think you can combine perf with Julia. perf give some extra stats that could help to elucidate the case, as average number of instructions per cycle and cache misses. If the average number of instructions is different between the two then somehow the core was running slower (either by thermal underclock or maybe the physical core had one logical core allocated to your run and another for other process), if the number of cache misses is different then other application can have trashed the cache in the worse runs.
Now I ran an example that uses no memory at all, and the equivalent code in Fortran (time measured with system’s time here). There is no clear difference between the variations in time measures. Also, running both sets side by side, there is a clear correlation of the times of one set (Julia) with the other set (Fortran), indicating that there is a system-wide thing going on, associated with the fluctuations.
function f()
s = 0.
for i in 1:50_000
x = sin(i)
for j in 1:20_000
y = cos(j)
if i%j > 2
s += x*y
end
end
end
s
end
@time f()
And the Fortran equivalent is:
code
program main
integer :: i, j
double precision :: s, x, y
s = 0.
do i = 1,50000
x = dsin(dble(i))
do j = 1,20000
y = dcos(dble(j))
if(mod(i,j) > 2) then
s = s + x*y
end if
end do
end do
write(*,*) s
end program main
Thus, my conclusion is that benchmarking is hard
edit: Complementing the previous experiment. Now I ran 100 times each of the two calculations above, side by side. The result is:
Note that there is a clear correlation between the launch time and the execution time for each language. That is, something happened system-wide that led both to be slower. I have here 4 cores, and these programs were using two of them only (I kept working with a spreadsheet in the meanwhile). The dispersion of the times does not seem to the significantly different from each other. Note also that the first and last executions appear to be faster, so even if there are more cores than programs being executed in the test, they appear to compete with each other (when the Fortran test finished, the Julia executions got quite faster, apparently).
So, this dispersion of execution times does not seem to be any particular of Julia.