Optim.optimize() uses excessive memory with autodiff?

I’m using Optim.optimize() to estimate a likelihood-type model. Having read that autodiff is a great feature, I have implemented it. I have three questions that I’m hoping someone can shed more light on.

For reference, the function I’m working with (and the code I’m using) is posted in this gist.

  1. I get the impression that taking advantage of autodiff uses more memory in the optimize() call than would an equivalent user-provided gradient. Is this true?

  2. The reason I ask the first question is because I’m worried my code is using too much memory. Basically, when I run it with @time, I get the following summary:

11.534809 seconds (907.69 k allocations: 7.605 GiB, 12.97% gc time)

This seems like a large amount of memory, but I’m not sure. One of the data arrays that enters the objective function is xg, which when I send it to Base.summarysize(), I get 4395600 returned, which I believe means 4.3956 GB. If that’s true, is it still the case that Optim.optimize() is using too much memory?

  1. I noticed that the memory allocations were much bigger if I let the optimizer run for a longer amount of time (e.g. by increasing the strictness of g_tol or f_tol, etc.). Does anyone know why that might be?

I thank you in advance for any illumination you can provide.

One thing I notice is that your objective function is allocating a lot of new arrays. Every time you do b[1:end-2] you’re copying that data (which allocates memory), and your final objective calculation allocates a new array and then sums over that array. In addition, you’re not getting perfect broadcast fusion because things like ((sector.==0)*1) will stop fusion because of the non-dotted *.

This also explains why allocation goes up as you run the optimization longer. The reported allocation is the total allocated during the entire run (you likely never actually consume that much memory at any one point because garbage is freed frequently), and since every call to your function allocates new temporary arrays, your total allocation goes up with every call.

I would suggest rewriting your objective function to, if possible, not allocate any new arrays at all. It should be possible to recreate your current approach by simply looping over the data. Vectorized operations in Julia are fast, but simple loops can often be the very fastest thing.

Also, I’m pretty sure summarysize is returning bytes, so your xg is 4.3956 Mb, not Gb.

1 Like

Could you make it so we can run the code via a simple copy and paste?
Eg, you could create the data via rand or randn.

As elaborated above, it’d be a lot more productive to focus on optimizing your objective function.

In regards to Optim’s memory allocations, many result from the objective and gradient calls not being type stable:
https://github.com/JuliaNLSolvers/NLSolversBase.jl/blob/master/src/objective_types/oncedifferentiable.jl
The differentiable objects don’t specialize on any of these:

mutable struct OnceDifferentiable{TF, TDF, TX} <: AbstractObjective
    f # objective
    df # (partial) derivative of objective
    fdf # objective and (partial) derivative of objective
    F::TF # cache for f output
    DF::TDF # cache for df output
    x_f::TX # x used to evaluate f (stored in F)
    x_df::TX # x used to evaluate df (stored in DF)
    f_calls::Vector{Int}
    df_calls::Vector{Int}
end

This is to prevent anything from having to recompile when you change the objective. Perhaps, this can someday be replaced with something like: GitHub - yuyichao/FunctionWrappers.jl ?

Alternatively, you can forget about compile times and integrate with ForwardDiff more directly, as I did here:
https://github.com/chriselrod/AsymptoticPosteriors.jl/blob/master/src/differentiable_objects.jl
But the price I payed is >100 second compilation on Julia 0.6 and ~10 seconds on Julia 0.7 (perhaps one of the few examples of 0.7 dramatically improving compile times). I’m sure when I get around to it, I can improve that by a lot. It may also be due to a bunch of the surrounding machinery in that library.
I’ll separate the “ForwardDiffDifferentiable” object into a separate library.

Runtime benefits of this specialization aren’t particularly impressive. For anything but extremely fast objective functions, negligible compared to the difference in compile times.

I’d definitely recommend focusing on your objective first, as suggested by rdeits.

1 Like

Hi!

Optim contributor here, so I’m obviously interested in reducing unnecessary allocations where ever we can! However, I really need a full MWE to say anything specific about your case. Please provide data (and complete import statements) or simply a function that generates appropriate data. Right now, it all comes down to guessing, and this means that fewer people will take a serious look at your (otherwise pretty complete) example.

P

Edit: Though I am seeing one huge warning sign here. You’re closing over arrays, so maybe you can verify that there are Box.'s all over a @code_warntype f(x). This is a bug that I really had hoped would be fixed in Julia v1.0, but it seems like that won’t be the case.

Edit2: Also, remember that the 7 gb doesn’t necessarily mean that at any point during the optimization attempt 7gb was allocated. Since you don’t reuse memory where you could (as mentioned above) you will have a lot of temporaries created.

Thanks, everyone. I’ve updated the originally linked gist to include a data generation function. It’s not the true data generating process, but it will suffice for testing my setup.
I’ll reply to each of you here to make this thread more readable.

@rdeits:
I didn’t know about the fusion stopping when scalar multiplication isn’t dotted; thanks for the tip. Also, that’s very helpful knowledge about memory allocation and garbage collection. And, yes, my math for summarysize was way off; thanks for pointing that out!

If I understand you correctly, you suggest re-writing the obj fun to loop over the data rather than use vectors?

I don’t see a way to not do β = b[1:end-2], but there’s probably some trick I’m missing.

@pkofod:
Regarding your first edit, are you suggesting that I enter @code_warntype wolslambda(x0,wageg,xg,abilg,vabilg,gflg,qg) at the REPL? I wasn’t sure if f(x) here refers to my objective or my TwiceDifferentiable().

Thanks also for the tip about garbage collection.

@Elrod:
Thanks for all of these helpful tips. It does indeed sound like the first-order concern is optimizing the objective function.

OK, I’ve played around some more with this and have the following summary. I ran four different versions of the function:

  1. original version
  2. stop allocating β = b[1:end-2] (by instead directly referencing b[1:end-2])
  3. stop allocating and also use correct loop fusion syntax
  4. stop allocating and also loop over observations

I’ve updated the gist to include these other functions, so you should be able to replicate the following code by simply typing include("optimexample.jl") at the REPL. (Note: I am presenting the 3rd or 4th go at these so as to discard any time spent in precompile)


 original
  1.290890 seconds (916.70 k allocations: 161.455 MiB, 2.64% gc time)
[0.098551, 0.0578882, 0.0738883, 0.0135176, 0.0416898, -0.00567907, 0.0158906, 0.0313328, 0.0661054, 0.0612557, 0.0387951, 0.0527893, 0.0307348, 0.319229, 0.107187]

 get rid of mutliple copying
  1.315693 seconds (917.40 k allocations: 169.704 MiB, 2.79% gc time)
[0.0984739, 0.0578457, 0.0738251, 0.0135264, 0.0416369, -0.00570354, 0.0158532, 0.0312826, 0.066048, 0.0611983, 0.0387441, 0.0527249, 0.0306972, 0.336222, 0.101665]

 get rid of multiple copying and use correct fusion operators
  0.803086 seconds (468.01 k allocations: 69.367 MiB, 2.64% gc time)
[0.0984739, 0.0578457, 0.0738251, 0.0135264, 0.0416369, -0.00570354, 0.0158532, 0.0312826, 0.066048, 0.0611983, 0.0387441, 0.0527249, 0.0306972, 0.336222, 0.101665]

 loop over observations
  0.539502 seconds (2.57 M allocations: 527.348 MiB, 11.93% gc time)
[0.0984739, 0.0578457, 0.0738251, 0.0135264, 0.0416369, -0.00570354, 0.0158532, 0.0312826, 0.066048, 0.0611983, 0.0387441, 0.0527249, 0.0306972, 0.336222, 0.101665]

I’m surprised that #4 runs the fastest, but also has the highest memory allocation. I’m more interested in speed than in memory conservation, but does that alarm any of you? My experience with Julia is that performant code tends to run faster and take up fewer resources.

Thanks again for your helpful comments, @rdeits, @Elrod, and @pkofod.

Sorry, my original advice wasn’t super clear. Whether you assign b[1:end-1] to a variable or not makes no difference, it’s the fact that any time you index with a slice or range you’re making a copy. In fact, your wolslambdaNoAssignLooping is actually allocating much more memory unnecessarily because now you’re doing b[1:end-1] (and thus making a copy) inside every loop iteration.

The core issue that’s causing your memory allocation in that version is:

dot(x[i,:],b[1:end-2])

which is creating a copy of both the slice of x and the range of b. Additionally, you’re slicing x along its last axis, which is very unfriendly to your cache and can cause a substantial performance loss (see https://docs.julialang.org/en/stable/manual/performance-tips/#Access-arrays-in-memory-order,-along-columns-1 ). Here are a few things you can do:

  • Transpose your representation of x (not by transposing inside the loop but just by changing the way you generate your data) so that you can do x[:, i] instead. This will improve cache-friendliness but won’t save you any copies
  • If you’re really just treating x as a collection of rows, rather than a matrix, then you can just store a Vector of Vectors instead. Then you can do x[i] instead of x[:, i] which will improve cache-friendliness and eliminate that allocated copy of the slice.
  • If you really want things to be super fast, I’d suggest getting rid of that dot product entirely and just writing out the loop. That way you can avoid slicing b[1:end-2] by just keeping track of the indices. This is slightly inconvenient, of course, and there are probably fancier ways to avoid having to write out the loop (for example, we could make a custom AbstractVector type that knows how to iterate over just its first N-1 elements), but loops are crazy fast in Julia and pretty easy to write.

That last suggestion would look something like (assuming you’ve transposed your representation of x as well):

for i in 1:size(y, 1)
  result = zero(promote_type(eltype(x), eltype(b)))
  for j in 1:size(x, 1)  # assuming size(x, 1) == size(b[1:end-2])
    result += x[j, i] * b[j]
  end
  other_computations_with_i_and_result()
end

Finally, once you’re absolutely sure that your code is correct and never accessing indices out of bounds, put an @inbounds in front of your for loop.

Right, so I think what we see is that in all but the most trivial cases, the operations inside of Optim (maybe unless you have a very high dimensional BFGS() or Newton()) will be dominated by objective and gradient calculations.

Do you know how to use @profile?

https://gist.github.com/pkofod/bf3efe415825b8c660e0f484a6185641

it’s basically telling you what others have said above: if you really want to improve performance, you have to fix how you access your data, and the best of all worlds will probably be nested loops with @inbounds annotations.

Also, I just checked, and your results are almost entirely timing compilation. See: Measuring performance

Running your first function a second time, for example, takes 0.05 seconds compared to the 1.29 you report above.

Ok, I did the following:

  • In your looping code, you were initializing SSE to 0.0, which not the correct type (when using autodiff, b will be a ForwardDiff.Dual), which was causing a type-instability. I changed it to zero(promote_type(eltype(x), eltype(b))) which looks complicated but get entirely compiled away into the right kind of zero
  • I added the nested loop version.
  • I put the test itself inside a function and ran everything twice to eliminate compilation time.

Code is here: https://gist.github.com/rdeits/ad0cb703e86afc26bdea90e71fbe00a2

Results are:

 original
  0.061302 seconds (4.17 k allocations: 114.632 MiB, 12.34% gc time)
[0.0208532, -0.0108786, 0.0212402, 0.0455574, 0.01722, 0.0444535, 0.023353, 0.0279052, 0.0106062, 0.0237208, 0.0422415, 0.00624842, 0.00563225, 0.697402, 0.0778012]

 get rid of mutliple copying
  0.060999 seconds (4.17 k allocations: 114.632 MiB, 12.60% gc time)
[0.0208703, -0.0108993, 0.0211875, 0.045544, 0.0171647, 0.0444521, 0.023389, 0.0278263, 0.0106049, 0.0237185, 0.0421798, 0.00622643, 0.00557982, 0.751189, 0.0723304]

 get rid of multiple copying and use correct fusion operators
  0.055456 seconds (1.04 k allocations: 42.388 MiB, 8.85% gc time)
[0.0208703, -0.0108993, 0.0211875, 0.045544, 0.0171647, 0.0444521, 0.023389, 0.0278263, 0.0106049, 0.0237185, 0.0421798, 0.00622643, 0.00557982, 0.751189, 0.0723304]

 loop over observations
  0.176941 seconds (1.20 M allocations: 384.960 MiB, 18.23% gc time)
[0.0208703, -0.0108993, 0.0211875, 0.045544, 0.0171647, 0.0444521, 0.023389, 0.0278263, 0.0106049, 0.0237185, 0.0421798, 0.00622643, 0.00557982, 0.751189, 0.0723304]

 no-alloc
  0.022569 seconds (759 allocations: 20.797 KiB)
[0.0208703, -0.0108993, 0.0211875, 0.045544, 0.0171647, 0.0444521, 0.023389, 0.0278263, 0.0106049, 0.0237185, 0.0421798, 0.00622643, 0.00557982, 0.751189, 0.0723304]

You guys are seriously amazing. Thank you for taking valuable time out of your day to help me with this.

@rdeits I was just grappling with not knowing why your suggested looped version wasn’t performing well. I never would have figured out zero(promote_type(eltype(x), eltype(b))). The code I had just written matches your update exactly, except for the type promotion you suggested (on both the SSE and b_dot_x variables). One other question: does it matter if @inbounds is on the for loop over the x? I put it on both for loops but noticed you only had it on the main one.

@pkofod I know of @profile, but I have no idea how to use it. I’m guessing it’s a red flag to see all of those instances of _unsafe_getindex?

As always, I’m reminded of the first rule of Julia performance: “Unexpected memory allocation is almost always a sign of some problem with your code, usually a problem with type-stability.”

1 Like

I’m like 90% sure you only need a single @inbounds around the outermost loop, as it will apply that flag to all the indexing operations within the entire expression (i.e. both loops).

2 Likes