Benchmarking with @time @btime and subsequent runs return shorter execution time

I am using @time and @btime macros to determine how my function preforms on different number of threads. I am mostly interested not in specific time but relative values. Threads aside, I noticed that the more times I run tests the faster the function gets. Below with @time:

julia> @time modeltest(mdl_init);
 20.720889 seconds (214.16 M allocations: 7.629 GiB, 31.47% gc time)
julia> @time modeltest(mdl_init);
  2.384151 seconds (174.74 M allocations: 5.730 GiB, 55.80% gc time)
julia> @time modeltest(mdl_init);
  2.171960 seconds (174.74 M allocations: 5.725 GiB, 65.92% gc time)

Similar outcome with @btime:

julia> @btime modeltest(mdl_init);
  6.387 s (174732441 allocations: 5.72 GiB)
julia> @btime modeltest(mdl_init);
  657.132 ms (174730885 allocations: 5.72 GiB)
julia> @btime modeltest(mdl_init);
  632.269 ms (174731867 allocations: 5.72 GiB)

Differences in execution times in @time and @btime aside (), where does differences between consecutive runs come from? Is Julia Compiler learning how to run function more efficiently? Or is it re-using RAM garbage?

Julia Version 1.5.3

The difference between the first @time and the others is due to the fact that in the first run of the function it gets compiled. Thus, in the first run you are measuring the compilation time and its allocations.

The differences between subsequent @time executions (after the first one) are probably random noise.

The benchmarks with @btime are probably wrong, because you need to interpolate the variables there, with $:

@btime modeltest($mdl_init)

be careful also if the function modeltest modifies the content of mdl_init, because @btime executes the function multiple times, thus times may vary because the input is different. This may be also a reason for such a disparity in the first and subsequent calls of @btime.

By the way: That amount of allocations and that amount of garbage collection probably indicate that there is something wrong (type instabilities) in your code.


Additional question:

See table:

Number of threads Relative speed Share of time used by Garbage Collector
1 19.89 4.02%
4 8.22 32.55%
8 6.76 46.96%
16 4.88 56.88%
32 1.31 69.62%
64 1.0 59.17%

Just based on a table above, without looking into code, can you tell me: does my code has problem with GC or is it @Threads related? Or a bit of both?

If relative speed and share of GC are multiplied the results are moreless consistent across nthreads.

In my not so long experience, I would say that 4% of GC is not necessarily an indication of a problem, but it may be. I had a similar situation and in my case I finally found where those allocations where occuring and fixed them, making the threaded version much better. Ideally one would like a code that does not allocate anything in the performance-critical parts.

I would try to track those allocations and be sure that they are strictly necessary.

Take a look at this thread: Track memory usage

So I should expect to see increase GC given more threads?

That probably indicates that many allocations occur in a part of the code that is being split through the threads.

I think the question here is, when you have 1 thread, is it doing the same number of calculations as when you have 64 threads? Or is the 64 thread version doing 64 times the number of calculations as when you have 1 thread?

To test apples to apples you would need to ensure that in the 64 thread version each thread is doing 1/64th the number of calculations as the thread in the single thread version. Otherwise you are looking at the GC time between two totally different calculations.

1 Like

I had that kind of GC increase when I had a type instability in the container of the results of the calculation, which was copied for each thread to avoid racing conditions. It smells something like that there.

Yes. Reducing allocations is quite important to speed up multi threaded code in my experience.