Tracking memory usage in unit tests (a lot worse in Julia 0.7 than 0.6)

Upgrading our package from Julia 0.6 to 0.7, we’ve noticed that unit tests use a lot more memory. Trivial tests that should in theory allocate a few KB can now allocate hundreds of MB. A minimal example:

function test()
    A = [x for x=0.0:2]
end

test()

Running it:

julia> @time include("test.jl")
  0.236493 seconds (730.15 k allocations: 38.490 MiB, 2.46% gc time)

I understand that the usual answer to this type of question is to time the code twice, since the first run includes JIT, etc. And indeed, if I run @time test() twice, the second run gives a more reasonable 0.000004 seconds (5 allocations: 272 bytes). But it doesn’t seem very practical when running a large number of unit tests to have to run each test twice (some of our tests take quite a bit of time). Is there a recommended way of doing this? Our goals are 1) somewhat accurate performance measurements so that we can detect regressions, and 2) minimize total time spent running test suite.

In particular, we can’t really tell if we’ve regressed since Julia 0.6, since first invocation of code seems to take a lot longer and use much more memory in Julia 0.7 than in 0.6. Same code above with Julia 0.6.4:

julia> @time include("test.jl")
  0.111263 seconds (60.98 k allocations: 3.406 MiB)
1 Like

Are you sure you eliminated all deprecation warnings? Sometimes they may not show explicitly. Run julia with the option --depwarn=yes to get errors instead.

1 Like

Yes, all deprecation warnings are removed. But it’s easy to reproduce the problem I’m referring to with a minimal unit test not related to our package:

using Test

function array()
    [x for x=0.0:2.0]
end

@test all(array() .< 10)

Running this test:

julia> @time include("test.jl")
  0.681974 seconds (2.28 M allocations: 114.947 MiB, 3.20% gc time)
Test Passed

With this kind of numbers, it becomes almost pointless to measure performance and memory usage, since I guess we’re measuring JIT compile time and not run time of the code tested.

So how do package developers solve this to track performance in their tests? The best idea I can think of is to alter each test to run twice, and do the timing on the second run only, and from within the test instead of outside, i.e.:

using Test

function array()
    [x for x=0.0:2.0]
end

@test all(array() .< 10)
@time all(array() .< 10)

Which results in more accurate numbers:

julia> include("test.jl")
  0.000015 seconds (10 allocations: 4.625 KiB)

But this means that we have to run our entire test suite twice, and it’s already slow enough as it is.

But BenchmarkTools would not solve our problem, would it? Isn’t the point of that tool to run code multiple times to get stable results? That’s precisely what we’re hoping to avoid.

Please read its docs, you can run it just one time (ie two times since it will not measure the first one), but IMO that’s pretty pointless for benchmarking.

To actually be able to catch regressions you’re going to have to run the suite multiple times, in order to average out random noise from differing configurations, machine states, neutrinos flipping a bit, etc… BenchmarkTools is not just the @btime and @benchmark macros, but provides a bunch of different configurations to support your own test suites. Check the documentation for a basic introduction into running custom tests. In particular, the part about a Trial and what it contains (e.g. number of allocations and allocated memory) is going to be of interest to you. As far as I can tell, if all you’re looking for is the memory used and not the time, running just once should be fine, as the allocations shouldn’t really change.

Hmm, thanks for the responses. I’ve used BenchmarkTools before, and just don’t see how it would help us since in the end its solution to the problem is to run code repeatedly (if even twice), which is what we’re hoping to avoid.

We run the test suite on the build server on every commit, and the build already takes ~10 minutes. What we are looking for are suggestions on how to get somewhat accurate performance measurements for each build – we don’t require nanosecond or single byte precision, but if the memory usage for a test grows by say 2x in a commit, that’s something we’ll want to find out as soon as possible. (And yes, memory usage is fairly stable when testing identical code once, but it’s not reliable at all; tiny irrelevant code changes can change JIT compiling behavior which allocates a lot more or less memory.)

It might just be the nature of Julia with its JIT compiler that what we’re asking for is not possible.

If so, how do other package developers do this?

Faster unit tests, so that they can be benchmarked properly for each build?
Not running benchmarks as part of the build, so that the benchmarks can be run for longer?
They don’t benchmark their unit tests?

Well, I’m not sure how other developers handle this, but I for one don’t run the full testsuite on every commit. with BenchmarkTools you are able to create test suites for specific parts and just run those - you’ll need some setup though to only run the tests for the code you changed (or what’s affected by the change). Those suites should only run once too, since you don’t care about the time in those parts.

In general though, I don’t think checking for regression of everything on every commit is a good choice, especially because that really limits iteration speed on your code. Your options are probably limited to “run the tests less often” and “write granular testsuites to be run independently”. A combination of both is probably a good choice.

Another idea would be to only benchmark unit tests when you already know they’re slow/use more memory, instead of all the time.

2 Likes

You might find the readme and the solution of this project useful
https://github.com/JuliaCI/BaseBenchmarks.jl

1 Like

If the problem has a scale (eg number of observations, grid size) that leaves the algorithm invariant, you could run on a small scale to compile, then benchmark on the large one.

1 Like

Basically you want to compile your function before calling it for the first time. Calling @code_native seems to compile your function, and you can call code_native with a dummy buffer to avoid printing the output. Maybe that doesn’t always work, or there’s a better way to trigger compilation.

julia> function test()
           A = [x for x=0.0:2]
       end
test (generic function with 1 method)

julia> code_native(IOBuffer(),test,())

julia> @time test()
  0.000030 seconds (6 allocations: 352 bytes)
3-element Array{Float64,1}:
 0.0
 1.0
 2.0
2 Likes

You could use precompile for this.

https://docs.julialang.org/en/v1/base/base/#Base.precompile

3 Likes

Does it precompile recursively (ie also the functions it calls)?

I think that’s impossible in full generality without actually running the code. It may be possible to precompile recursively the calls that are resolved with static dispatch, but that might be problematic also (because it might precompile too much)

Can you test how much time is needed to run something twice? Is it 20% longer or 99% longer. Depending the result you will need different strategy: 20% case I would just live with and the second case I would look into precompiling.

I found some more stuff that might be of interest to you:

https://docs.julialang.org/en/v1/manual/profile/#Memory-allocation-analysis-1

Especially Coverage.jl, since single function memory allocation can be tested. This could be integrated into a workflow where changed code gets tested automatically for memory allocation regressions.

Thanks for all the advice! Several good suggestions here, which we’ll look further into.

I am a bit skeptical to relying on code coverage for any part of this, since I find that it’s not working very well in Julia 0.7. I just started a new topic about this.

julia -O, --optimize={0,1,2,3} Set the optimization level (default 2 if unspecified or 3 if specified as -O)

I don’t know why Julia got slowing, but assume, since there’s a tradeoff for compilation speed and optimization, that the default level is now more aggressive. Maybe lower, e.g. -01 helps?

Changing the optimization level seems like a bad idea when you want to test the performance.

2 Likes