Efficient linear algebra for large-scale problems in JuMP



Hello all. I’ve finally had a chance to sit down and look carefully at some of the code in JuMP 0.19. In my work I very frequently have to do problems with \sim 10^6 variables. Obviously implementing these involves a lot of linear algebra, and I frequently find that, while I’m very grateful to have the specialized code that already exists within JuMP, it doesn’t quite cut it for problems of this scale. So, I wanted to open this thread to get some feedback from the JuMP devs on what current thinking and future plans are, and to find out where I might be able to be of assistance.

It seems to me that the fundamental problem here is that all of the AbstractJuMPScalar objects are mutable objects which typically wind up using heap allocated memory. The experimenting I’ve done so far would suggest that it would probably be sufficient if we could build ways to allocate these all at once during linear algebra operations. As a simple example of what I’m talking about, see

julia> @benchmark begin
           A = Int[]
           for i ∈ 1:100
               push!(A, 1)
  memory estimate:  2.27 KiB
  allocs estimate:  7
  minimum time:     683.370 ns (0.00% GC)
  median time:      713.286 ns (0.00% GC)
  mean time:        799.731 ns (8.64% GC)
  maximum time:     304.204 μs (99.63% GC)
  samples:          10000
  evals/sample:     154

julia> @benchmark ones(Int, 100)
  memory estimate:  896 bytes
  allocs estimate:  1
  minimum time:     75.177 ns (0.00% GC)
  median time:      80.265 ns (0.00% GC)
  mean time:        99.438 ns (16.71% GC)
  maximum time:     49.149 μs (99.41% GC)
  samples:          10000
  evals/sample:     972

The simplest example of this is the inner product here. What this does is create a AbstractJuMPScalar “zero” and then append to it one term at a time. Instead, it would be more efficient to allocate it all at once. Essentially the same thing is happening in matrix multiplication here. I think implementing a more efficient version should be relatively straightforward, but I would be interested to hear if there are any pitfalls that make this much more complicated than I think it is.

Another issue is keeping around lots useless 0's, i.e. AffExpr([x], [0.0], 0.0). You can wind up in some situations where there’s simply a huge number of these lying around, and you are spending a huge amount of computation time adding them to some expression. I remember @miles.lubin mentioning that he was considering turning AbstractJuMPScalars into Dicts rather than Vectors, but I haven’t seen an issue in the JuMP repo. Are there any concrete plans for this? If effort is put into improving the linear algebra it would probably be better to do both of these things at the same time.

Anyway, thanks all, looking forward to JuMP being in a state where it really flies on big problems, and I hope I can be helpful in getting it there. If I’m just way behind and much of this is being worked on already, all the better!


The fundamental problem is that sum of two affine expressions can’t be stored in a constant amount of memory. That’s very different from floating-point numbers. I discuss this issue and how JuMP deals with it using macros at https://youtu.be/JaA302TfI7I?t=7m15s.

You seem to be on the right track.

This is my #2 coding priority after https://github.com/JuliaOpt/MathOptInterface.jl/issues/211.


Ah, that explains a lot thanks. In particular, I was missing two huge pieces of the picture:

  • I was missing the significance of sizehint!, it seems that it can drastically improve performants in many of the worst cases. By the way, is there some reason there is no size hint in _dot, I kept looking at that function and kept thinking that nothing else had it either.
  • Having seen all of the operator overloading, I didn’t think that the macros were really doing anything most of the time.

I know this is a bit of a loaded question, but could you perhaps give a very brief description of what extra steps are happening in the macros that don’t just come from the operator overloading (your YouTube talk actually made it seem like most of what you’d do in the macro is implemented in the operator overloading)? As a concrete example, I often have problems where I need to do something like


where A<:AbstractMatrix{<:Number}, B<:AbstractMatrix{<:Number}, x<:AbstractSparseMatrix{JuMP.Variable}. I usually wind up writing an explicit function to make these tractable, but I’ve always timed it against operations I do outside of macros. Would something different happen in an expression like this within a macro? (It’s just so much harder for me to understand the macro code than the operator code, so I don’t think I understand what’s going on there at all.)

Julia is a very different and interesting language

Not that I’m aware of. PRs are welcome!

sum(a[i]*x[i] for i in 1:N) is translated within the macros to an efficient for loop that appends to an expression. When a[i] and x[i] are scalars, this is asymptotially faster than if you write the same sum outside of a macro. All vector operations essentially fall back on the operator overloads, however.


If N=1e6, does this result in an expression of length 10^6? What’s the memory footprint of doing it this way?


Yes, it does. That’s the best you could possibly hope for if all the x[i] are unique and a[i] are nonzeros. The latter two cases will be better addressed when we change the underlying storage for AffExpr to be a Dict mapping variables to coefficients.


Ah, ok that makes sense thanks.

I’m still wondering if there is anything to be gained from using properly allocated array constructors as opposed to sizehint! within the operator code however. Have you considered both possibilities? These results have me wondering

@benchmark begin
    A = Float64[]
    sizehint!(A, N)
    for i ∈ 1:N
        push!(A, 1.0)
  memory estimate:  8.14 KiB
  allocs estimate:  2
  minimum time:     4.952 μs (0.00% GC)
  median time:      5.209 μs (0.00% GC)
  mean time:        6.722 μs (16.05% GC)
  maximum time:     7.592 ms (99.86% GC)
  samples:          10000
  evals/sample:     6

@benchmark ones(Float64, N)
  memory estimate:  8.13 KiB
  allocs estimate:  1
  minimum time:     904.118 ns (0.00% GC)
  median time:      1.081 μs (0.00% GC)
  mean time:        1.581 μs (28.23% GC)
  maximum time:     2.402 ms (99.89% GC)
  samples:          10000
  evals/sample:     17

Granted I don’t know whether this is close enough to a realistic case to mean anything. I’ve often written “hacks” of special-case matrix operations that allocate entire vectors (i.e. which I think can be generalized), and anecdotally they usually seem noticeably faster, which is what prompted me to open this thread. I feel like I’m still missing something.