I am using @time and @btime macros to determine how my function preforms on different number of threads. I am mostly interested not in specific time but relative values. Threads aside, I noticed that the more times I run tests the faster the function gets. Below with @time:
julia> @btime modeltest(mdl_init);
6.387 s (174732441 allocations: 5.72 GiB)
julia> @btime modeltest(mdl_init);
657.132 ms (174730885 allocations: 5.72 GiB)
julia> @btime modeltest(mdl_init);
632.269 ms (174731867 allocations: 5.72 GiB)
Differences in execution times in @time and @btime aside (), where does differences between consecutive runs come from? Is Julia Compiler learning how to run function more efficiently? Or is it re-using RAM garbage?
The difference between the first @time and the others is due to the fact that in the first run of the function it gets compiled. Thus, in the first run you are measuring the compilation time and its allocations.
The differences between subsequent @time executions (after the first one) are probably random noise.
The benchmarks with @btime are probably wrong, because you need to interpolate the variables there, with $:
@btime modeltest($mdl_init)
be careful also if the function modeltest modifies the content of mdl_init, because @btime executes the function multiple times, thus times may vary because the input is different. This may be also a reason for such a disparity in the first and subsequent calls of @btime.
By the way: That amount of allocations and that amount of garbage collection probably indicate that there is something wrong (type instabilities) in your code.
In my not so long experience, I would say that 4% of GC is not necessarily an indication of a problem, but it may be. I had a similar situation and in my case I finally found where those allocations where occuring and fixed them, making the threaded version much better. Ideally one would like a code that does not allocate anything in the performance-critical parts.
I would try to track those allocations and be sure that they are strictly necessary.
I think the question here is, when you have 1 thread, is it doing the same number of calculations as when you have 64 threads? Or is the 64 thread version doing 64 times the number of calculations as when you have 1 thread?
To test apples to apples you would need to ensure that in the 64 thread version each thread is doing 1/64th the number of calculations as the thread in the single thread version. Otherwise you are looking at the GC time between two totally different calculations.
I had that kind of GC increase when I had a type instability in the container of the results of the calculation, which was copied for each thread to avoid racing conditions. It smells something like that there.