Speeding up JuMP model creation with sets that depend on other indexes

I am looking for tips on building a model with variables, expressions, and constraints over sets that depends on indexes from other sets. For instance:

@expression(model, [i = 1:5, j = blah[i]], ...)

where blah is a dictionary.
Using add_to_expression! with a Dict gives some speed up, but it also makes the model harder to read, which we would like to avoid.

I have a MWE below, based on the larger model in GitHub - TulipaEnergy/TulipaEnergyModel.jl: Tulipa Energy Model.
The context here is, in few words, flow in a graph over time blocks. These time blocks depend on the flow itself.
Furthermore, there are balancing constraints that happens over possibly other time blocks.

using JuMP

function create_model(num_vertices)
  model = Model()

  V = 1:num_vertices
  E = vcat([[(i, j) for j = i+1:num_vertices] for i = 1:num_vertices]...)

  T1 = [1:3, 4:6, 7:9, 10:12]
  T2 = [1:4, 5:8, 9:12]
  T3 = [1:6, 7:12]
  T_options = [T1, T2, T3]

  TV = Dict(
    v => T_options[v%3+1]
    for v in V
  )
  TE = Dict(
    (a, b) => T_options[(a+b)%3+1]
    for (a, b) in E
  )

  @variable(model, flow[e ∈ E, TE[e]] ≥ 0)

  @objective(model, Min,
    sum(flow[e, T] for e in E, T ∈ TE[e])
  )

  @expression(model,
    incoming[v ∈ V, Tᵥ ∈ TV[v]],
    sum(
      length(Tₑ ∩ Tᵥ) * flow[(src, v), Tₑ]
      for src in [src for (src, dst) in E if dst == v], Tₑ ∈ TE[(src, v)]
    )
  )
  @expression(model,
    outgoing[v ∈ V, Tᵥ ∈ TV[v]],
    sum(
      length(Tₑ ∩ Tᵥ) * flow[(v, dst), Tₑ]
      for dst in [dst for (src, dst) in E if src == v], Tₑ ∈ TE[(v, dst)]
    )
  )
  @constraint(model,
    balance[v ∈ V, Tᵥ ∈ TV[v]],
    incoming[v, Tᵥ] == outgoing[v, Tᵥ]
  )

  return model
end

function main(; num_vertices=1000)
  create_model(num_vertices)
end

main()
@time main()
@profview main()

Thanks in advance.

[src for (src, dst) in E if dst == v]

The issue is almost certainly lines like this, because it requires iterating over all the edges lots of times.

You can improve with a data structure like:

E_by_dst = Dict{Int,Vector{Int}}()
for (src, dst) in E
    if !haskey(E_by_dst, dst)
        E_by_dst[dst] = Int[]
    end
    push!(E_by_dst[dst], src)
end

Then your expression becomes:

  @expression(model,
    incoming[v in V, Tᵥ in TV[v]],
    sum(
      length(Tₑ ∩ Tᵥ) * flow[(src, v), Tₑ]
      for src in E_by_dst[v], Tₑ in TE[(src, v)]
    )
  )

It’s basically this issue: JuMP, GAMS, and the IJKLM model | JuMP

You could also consider a radically different data structure. Something like:

function create_model3(num_vertices)
  V = 1:num_vertices
  E = reduce(vcat, [[(i, j) for j in i+1:num_vertices] for i in 1:num_vertices])
  T1 = [1:3, 4:6, 7:9, 10:12]
  T2 = [1:4, 5:8, 9:12]
  T3 = [1:6, 7:12]
  T_options = [T1, T2, T3]
  TV = Dict(v => T_options[v%3+1] for v in V)
  TE = Dict((a, b) => T_options[(a+b)%3+1] for (a, b) in E)

  df_e = DataFrames.DataFrame(
    [(src = e[1], dst = e[2], t = t) for e in E for t in TE[e]]
  )
  df_v = DataFrames.DataFrame([(v = v, t = t) for v in V for t in TV[v]])

  model = Model()
  df_e.flow = @variable(model, flow[1:size(df_e, 1)] >= 0)
  @objective(model, Min, sum(df_e.flow))
  df = DataFrames.outerjoin(
    DataFrames.combine(
      DataFrames.groupby(df_e, [:dst, :t]),
      :flow => sum => :incoming,
    ),
    DataFrames.combine(
      DataFrames.groupby(df_e, [:src, :t]),
      :flow => sum => :outgoing,
    );
    on = [:t, (:dst => :src)],
  )
  DataFrames.rename!(df, :dst => :v, :t => :t_e)
  df.incoming .= coalesce.(df.incoming, AffExpr(0.0))
  df.outgoing .= coalesce.(df.outgoing, AffExpr(0.0))
  df_join = DataFrames.outerjoin(df, df_v; on = [:v])
  df_join.t_len = [length(t_e ∩ t) for (t_e, t) in zip(df_join.t_e, df_join.t)]
  filter!(r -> r.t_len > 0, df_join)
  for d in DataFrames.groupby(df_join, [:v, :t])
    @constraint(model, sum(d.t_len .* d.incoming) == sum(d.t_len .* d.outgoing))
  end
  return model
end
1 Like

The issue is almost certainly lines like this, because it requires iterating over all the edges lots of times.

Great tip. We’re actually using MetaGraphsNext, so this step is actually for src in inneighbor_labels(graph, v) in our code, but there are other things that I am still working on changing that are like that (when the length(..) = 0).
I had to leave it simpler for the benefit of MWE.

You could also consider a radically different data structure. Something like:

This looks interesting. I haven’t got the details, but it looks like the main ideas are to avoid the internal sparse structure and to do the math and filtering beforehand, right?
I will try to apply this idea to our model and compare with the other simpler idea of using Dict, thanks.

Our main fear here is compromising the readability of the code. In both your idea and mine, we lose the immediate explanation of what is going on, so it will be harder for contributors with less Julia experience to JuMP in (pun intended).
Here is our code, if you want to take a peek: https://github.com/TulipaEnergy/TulipaEnergyModel.jl/blob/895220c9c0e06fac941f72047d2ddbfb3afa07e0/src/model.jl#L84

The data frame design probably needs some careful thought. But with care, it would be readable for someone come from an data/analytics background. It’s not the usual math programming formulation.

If you go with the sets and indexing formulation, then you need to avoid O(n^2) operations like @expression(model, [a in A], sum(x[b] for (a_, b) in B if a_ == a).

The general tip would be: lift the filtering and subsets outside the JuMP code. Don’t write nested summations with if-statements.

@abelsiqueira I hope you don’t mind me offering yet another “radically different data structure” :slightly_smiling_face:

You can consider using “ACSets” exported by the Catlab.jl package. They are a generalization of C-Sets, a structure from category theory, but basically end up being a lot like an in-memory relational database with (optional) equations asserting equivalence (aka commutativity) between paths.

While there are multiple ways to build a data base schema for your data, one is as a “decorated graph”. In the schema diagram below, E and V will be sent to Sets (in the mathematical sense), and TimeType will be sent to a Julia type. The arrows will be functions. The labeled ovals in the diagram are combinatorial data, and the dot (TimeType) is atomic data.

With that, you can build your model in the following manner. I personally think it is very readable this way, but your opinion may differ. Code like NetData[e,:src] walks forwards along arrows (foreign keys/functions), and code like incident(NetData, 1, :src) finds preimages, so for the example, it returns the set of edges whose source is vertex 1.

using Catlab, JuMP

@present NetSch(FreeSchema) begin
    (V,E)::Ob
    src::Hom(E,V)
    tgt::Hom(E,V)
    TimeType::AttrType
    te::Attr(E,TimeType)
    tv::Attr(V,TimeType)
end

to_graphviz(NetSch, graph_attrs=Dict(:dpi=>"72",:size=>"3.5",:ratio=>"expand"))

@acset_type NetType(NetSch, index=[:src,:tgt])

num_vertices = 100

T1 = [1:3, 4:6, 7:9, 10:12]
T2 = [1:4, 5:8, 9:12]
T3 = [1:6, 7:12]
T_options = [T1, T2, T3]

# make the acset
NetData = NetType{typeof(T1)}()

# add the verticies
V = 1:num_vertices
add_parts!(
    NetData, :V, length(V),
    tv = [T_options[v%3+1] for v in V]
)

# add the edges
E_list = [(i,j) for i in 1:num_vertices for j in i+1:num_vertices]
add_parts!(
    NetData, :E, length(E_list),
    src = first.(E_list), tgt = last.(E_list),
    te = [T_options[(a+b)%3+1] for (a,b) in E_list]
)

E = parts(NetData,:E)

# build the JuMP model 
model = JuMP.Model()
@variable(
    model, 
    flow[e=E, NetData[e,:te]] ≥ 0
)
@objective(
    model, 
    Min,
    sum(flow[e,t] for e=E, t=NetData[e,:te])
)
@expression(
    model,
    incoming[v=V,Tᵥ=NetData[v,:tv]],
    sum(
        length(Tₑ ∩ Tᵥ) * flow[e,Tₑ]
        for e in incident(NetData,v,:tgt), Tₑ in NetData[e,:te]
    )
)
@expression(
    model,
    outgoing[v=V,Tᵥ=NetData[v,:tv]],
    sum(
        length(Tₑ ∩ Tᵥ) * flow[e,Tₑ]
        for e in incident(NetData,v,:src), Tₑ in NetData[e,:te]
    )
)
@constraint(
    model,
    balance[v=V,Tᵥ=NetData[v,:tv]],
    incoming[v,Tᵥ] == outgoing[v,Tᵥ]
)

I’ll also state that this was just the most direct translation of your original code into an ACSet formulation. It may not be the most efficient, especially if all the times could reasonably be represented as a finite set rather than an arbitrary Julia type. Also, for larger models, I second @odow’s suggestion to do subsetting and filtering as much as possible outside of the JuMP code, for which various tools exist in Catlab (e.g. conjunctive queries).

1 Like

Hi Sean, thanks for the suggestion. In my actual code I am using MetaGraphs. Maybe you can comment the advantages of using CatLab?
Also, it doesn’t seem to avoid the problem in the model of having a set dependent on the index of another set (NetData[e,:te] depends on e in E).

Hi @abelsiqueira

Yes, C-Sets generalize graphs and data frames (as well as other data structures, such as Petri nets, etc). The code for interacting with the specific C-Set you make (based on your definition of schema C) is automatically generated based on the type information, and is generally on par with “hand coded” implementations in terms of efficiency. There is also the ability to use various operations from category theory, such as functorial data migration, to easily move data between different schemas. If you’re interested, I’d recommend Shaped Data with Acsets | Owen Lynch | JuliaCon2021 - YouTube! There is a research paper here Categorical Data Structures for Technical Computing (arxiv.org) which is more mathematical.

Yes, it doesn’t avoid that “problem”, but that seems a rather essential part of your data, and part of the intuitive understanding of how the model works? I admit I only took a very literal translation of your model into a C-Set but the operation taking a vertex v, finding the set of edges which map into it e, and then grabbing the times associated with those edges is very naturally expressed in that way. Perhaps some kind of conjunctive query might be slightly faster, or some data migration to a smaller schema that only deals with the construction of those @expression terms. I’ll think about it if I have time.

x-ref: Help improving the speed of a DataFrames operation - #3 by abelsiqueira