Confusing benchmark time results and memory allocation depending on number of calls for function with zip

stakaz · March 8, 2019, 8:53am

Hello, I recently discovered a very strange behavior which I cannot understand and I would like to ask you for help on this.

I have simplified the original code as much as possible to reproduce this behavior. Consider the following script:

using Statistics 
using BenchmarkTools 

f(X, av = (X,D) -> mean(X)) = av((a for a ∈ X.A), X)
g(X, av = (X,D) -> mean(X)) = av((a * b for (a, b) ∈ zip(X.A, X.B)), X)

N = 100_000 

X = (A = rand(N), B = rand(N))

@inline function reweight(A, W, ω)
  N = 0.0
  D = 0.0
  for (a,w) ∈ zip(A,W)
    P = exp(ω * w)
    N += a * P
    D += P
  end
	return 	N / D
end

println("f")
@btime f($X) ## try to comment out this line, restart julia and run twice
@btime f($X, (X,D) -> reweight(X, D.B, 0.1))
@btime f($X, (X,D) -> reweight(X, D.B, 0.1))

println("g")
@btime g($X) ## try to comment out this line, restart julia and run twice
@btime g($X, (X,D) -> reweight(X, D.B, 0.1))
@btime g($X, (X,D) -> reweight(X, D.B, 0.1))

Running it in a freshly opened julia session give following results on my computer:

f
  148.624 μs (1 allocation: 16 bytes)
  4.417 ms (1 allocation: 16 bytes)
  4.406 ms (1 allocation: 16 bytes)
g
  322.668 μs (3 allocations: 64 bytes)
  4.844 ms (3 allocations: 64 bytes)
  4.839 ms (3 allocations: 64 bytes)

Basic things to note: both, f and g with reweight run fast and allocate nearly no memory. Also, both runs for each of the functions are quite the same.

Now, comment out the marked lines in the code, restart julia and run the script again. The results from the first run are:

f
  4.409 ms (1 allocation: 16 bytes)
  4.405 ms (1 allocation: 16 bytes)
g
  274.319 ms (1299496 allocations: 30.51 MiB)
  270.832 ms (1299496 allocations: 30.51 MiB)

but from the second run I get:

f
  4.407 ms (1 allocation: 16 bytes)
  4.402 ms (1 allocation: 16 bytes)
g
  4.839 ms (3 allocations: 64 bytes)
  4.843 ms (3 allocations: 64 bytes)

Things to note: the first two runs for g with reweight are slow and consume a lot on memory. But both calls from the second run are fast again. However, for f every run is fast.

So the question is: how the call g(X) affects the speed of the call with the reweight function and why does the same not happen when calling the function twice but only when rerun the code a second time?

As you see, the difference in f and g is the usage of zip. Does this explain the behavior somehow?

You can also use @time, the picture is the same, so it is not due to BenchmarkTools.

Thanks.

Tamas_Papp · March 8, 2019, 2:33pm

I can’t replicate the problem (the discrepancy on the second run) on master. Also note that the relative timings improve a lot.

julia> println("f")
f

julia> @btime f($X, (X,D) -> reweight(X, D.B, 0.1))
@btime f($X, (X,D) -> reweight(X, D.B, 0.1))
  648.059 μs (1 allocation: 16 bytes)
0.49993090371947524

julia> @btime f($X, (X,D) -> reweight(X, D.B, 0.1))
  643.599 μs (1 allocation: 16 bytes)
0.49993090371947524

julia> println("g")
g

julia> # @btime g($X) ## try to comment out this line, restart julia and run twice
       @btime g($X, (X,D) -> reweight(X, D.B, 0.1))
  76.858 ms (1299496 allocations: 30.51 MiB)
0.2546995589959171

julia> @btime g($X, (X,D) -> reweight(X, D.B, 0.1))
  76.494 ms (1299496 allocations: 30.51 MiB)
0.2546995589959171

julia> println("f")
f

julia> # @btime f($X) ## try to comment out this line, restart julia and run twice
       @btime f($X, (X,D) -> reweight(X, D.B, 0.1))
  648.604 μs (1 allocation: 16 bytes)
0.49993090371947524

julia> @btime f($X, (X,D) -> reweight(X, D.B, 0.1))
  648.633 μs (1 allocation: 16 bytes)
0.49993090371947524

julia> println("g")
g

julia> # @btime g($X) ## try to comment out this line, restart julia and run twice
       @btime g($X, (X,D) -> reweight(X, D.B, 0.1))
  77.313 ms (1299496 allocations: 30.51 MiB)
0.2546995589959171

julia> @btime g($X, (X,D) -> reweight(X, D.B, 0.1))
  77.880 ms (1299496 allocations: 30.51 MiB)
0.2546995589959171

julia> VERSION
v"1.2.0-DEV.445"

stakaz · March 8, 2019, 3:28pm

So in you example you always have the “bad” result for g? What about not commenting out the marked lines?

stakaz · March 8, 2019, 4:57pm

I expect that the faster time without much memory should be the right one. By the way, I am on Julia 1.1.0.

c42f · March 9, 2019, 4:18am

I can reproduce this with 1.1 on linux.

This kind of thing can happen when changing the order in which functions are type inferred. Type inference reuses inference results from a cache whenever possible. When inferring the result of a call tree, some subtrees can already be in the cache which makes the job easier for type inference. Conversely, if the subtree isn’t present, inference may sometimes decide things are getting “too hard” and start approximating.

When you redefine f, g and reweight they need to be re-inferred so more cached results may be used the second time around and it’s possible for inference to return a more precise result.

Here is a further reduction:

using Statistics 
using BenchmarkTools 

f(X) = reweight((a for a ∈ X.A), 0.1)
g(X) = reweight((a * b for (a, b) ∈ zip(X.A, X.B)), 0.1)

@inline function reweight(A, ω)
    N = 0.0
    for (a,) ∈ zip(A)
        N += a
    end
    return N
end

N = 100
X = (A = rand(N), B = rand(N))

println("f")
@btime f($X)

println("g")
@btime g($X)
@code_warntype g(X)

Note that the output of @code_warntype is different when this chunk of code is included twice.

stakaz · March 9, 2019, 8:32pm

Oh, this is a good point. I could understand it completely when it would be an additional overhead in time and memory due to inference. But here the problem scales with N. So, when it is large I have a huge amount of memory and a very slow evaluation in the first evaluation but the second one is fast again. So my impression is, it has something to do with the zip function. Which somehow forces a allocation in memory for the first run but not for the second.

The most important question here is, how can I force the fast version of the code and whether this can be avoided, because usually you would not notice such things in you code.

c42f · March 10, 2019, 11:09pm

I agree this is probably an issue with inferring expressions involving zip, possibly something to do with it being used in two places causing inference to give up prematurely. I don’t know though.

I think at this point it might be best to submit an issue on github. Hopefully it can be fixed by a tweak to the inference heuristics or it might even be an inference bug exposed by a corner case.

stakaz · March 11, 2019, 12:01pm

So here is the link to the issue on github.

c42f · March 11, 2019, 12:17pm

Thanks, improving inference is probably the only long term and general way to fix these kind of problems.

If you need a fix in the short term there’s no doubt many ways this can be rewritten into a more ugly but faster version without the generators.

stakaz · March 11, 2019, 12:27pm

could you show me how? This version was kind of the fastest I have came up with after thinking about it for a while. My use case is N \approx 10^9 so it is a hot loop and I really want to avoid any copy processes.

c42f · March 11, 2019, 12:52pm

How about using this version of reweight to explicitly write out the iteration which would be normally wrapped up by the zip:

@inline function reweight(A, W, ω)
  N = 0.0
  D = 0.0
  ai = iterate(A)
  wi = iterate(W)
  while ai !== nothing && wi !== nothing
      P = exp(ω * wi[1])
      N += ai[1] * P
      D += P
      ai = iterate(A,ai[2])
      wi = iterate(W,wi[2])
  end
  return N/D
end

A cursory test seems to show this works around the problem.

stakaz · March 11, 2019, 12:56pm

You*ve made my day. I am programming on the part of my code where I need this function and yes, from the first test it seams like the generated code is fast.

c42f · March 12, 2019, 12:37am

Here’s the truly minimal reproduction:

julia> @code_warntype sum(x->x[1][1], zip(zip([1,2,3])))
Body::Any
1 ─ %1 = Base.add_sum::Core.Compiler.Const(Base.add_sum, false)
│   %2 = invoke Base.mapfoldl_impl(_2::Function, %1::Function, $(QuoteNode(NamedTuple()))::NamedTuple{(),Tuple{}}, _3::Base.Iterators.Zip{Tuple{Base.Iterators.Zip{Tuple{Array{Int64,1}}}}})::Any
└──      return %2

Topic		Replies	Views
Allocations in function timing Performance	2	554	October 9, 2018
@benchmark reporting (probably) wrong memory usage (BenchmarkTools.jl) Tooling	4	885	June 2, 2018
Allocations (again...) Performance question , allocations	15	671	November 25, 2022
Performance: tuple unpacking General Usage	8	1964	October 5, 2017
Surprising runtime behaviour when wrapping functions Performance question	4	415	September 14, 2021

Confusing benchmark time results and memory allocation depending on number of calls for function with zip

Related topics