Matrix multiplication time with JuMP variables

I am using JuMP to solve some large problems. A bottleneck right now is computing a matrix-vector product.
M is a short and fat matrix of floats, and x is a variable vector. Mx will be a vector of moderate length, but forming this vector is prohibitively slow. Is there anything I can do to accelerate this?
For example, let M be 10^4 x 10^4 and x be of length 10^4. If everything is just a real number, Mx in Julia takes less than .1s for me. But if x is a JuMP vector variable, it’s taking > 30s.
Thanks for any advice!

@blegat

Hi @mhar, I’ve moved your question to the “Optimization (Mathematical)” section, which is where I try to keep all the JuMP-related questions. There’s no need to specifically tag @blegat (I tend to read most questions).

Is M dense or sparse?

If dense: JuMP is not really designed for dense problems. Your M * x is creating 10^4 linear expressions, each with 10^4 affine terms. That’s a lot of stuff! JuMP is mainly designed with the assumption that you have many constraints (10^4 is not large), each with a small number of terms.

The algorithms and data structures are very different, so there’s little point in comparing with Float64 multiplication.

As the backing data structure, JuMP uses an OrderedCollections.OrderedDict. You can measure that the bottleneck is creating these dictionaries:

using JuMP
using OrderedCollections: OrderedDict
n = 10^4
model = Model();
@variable(model, x[1:n]);
M = rand(n, n);
@time [OrderedDict(x[j] => M[i, j] for j in 1:n) for i in 1:n];

What is the real problem that you are trying to solve? Perhaps there is a better way.

1 Like

Thanks for the reply!

I don’t completely follow the connection with the code you’ve written. Why does it need to allocate a new entry for each j?
I modified the code slightly to get some timings:

using JuMP
using OrderedCollections: OrderedDict
n = 10^4
model = Model();
@variable(model, x[1:n]);
M = rand(n, n);
function formdict(x, M, n1, n2)
    [OrderedDict(x[j] => M[i, j] for j in 1:n1) for i in 1:n2];
end
for n1 = 1:10
    @time formdict(x, M, n1, n);
end
  0.013523 seconds (110.00 k allocations: 7.095 MiB)
  0.009109 seconds (130.00 k allocations: 7.553 MiB)
  0.009890 seconds (150.00 k allocations: 8.011 MiB)
  0.604213 seconds (170.00 k allocations: 8.469 MiB, 98.54% gc time)
  0.006124 seconds (210.00 k allocations: 11.063 MiB)
  0.006867 seconds (230.00 k allocations: 11.520 MiB)
  0.007768 seconds (250.00 k allocations: 11.978 MiB)
  1.671560 seconds (270.00 k allocations: 12.436 MiB, 99.48% gc time)
  0.009563 seconds (310.00 k allocations: 16.861 MiB)
  0.010252 seconds (330.00 k allocations: 17.319 MiB)

This is indeed a similar pattern that garbage collection takes a big toll, and it seems to dominate every iteration as n1 grows. It seems weird that even for n1 = 1 the number of allocations is so high (10n2).

I see many possible ways to model the problem I’m working on, but I’m not sure how to avoid having a lot of terms in each expression. I will try to write a simple version of it here if you think that’ll help pinpoint if this is possible with JuMP.

Why does it need to allocate a new entry for each j?

Because we don’t store M * x as a matrix * vector, but as the scalar summation equivalent.

I’ve been playing with the original problem, and I think there is some sparsity structure. Do I need to manually exploit it? The following experiment suggests that the current inner product doesn’t leverage the sparsity.

N = 10^5
n = 10^2
model = SOSModel()
longvar = @variable(model, [1:N])
shortvar = @variable(model, [1:n])
longdensevec = zeros(N);
longdensevec[1:n] .= 1.0;
longsparsevec = sparse(longdensevec);
shortvec = ones(n) * 1.0;
@time longdensevec' * longvar;
@time longsparsevec' * longvar;
@time shortvec' * shortvar;

The output is
julia> @time longdensevec’ * longvar;
0.000207 seconds (531 allocations: 20.234 KiB)

julia> @time longsparsevec’ * longvar;
0.000540 seconds (831 allocations: 28.062 KiB)

julia> @time shortvec’ * shortvar;
0.000043 seconds (531 allocations: 20.234 KiB)

Why is the sparse vector actually the slowest of these? (It seems like this pattern holds even for larger N, when I’d think that sparsity should help even more…)

Thanks for your advice!