Slow model generation in JUMP

I need to decide between using JUMP and Python-MIP for a large MILP. Python-MIP has some impressive benchmarks on its website, and provides more features such as lazy constraints and call-backs with CBC solver. But here I am more interested in speed than features.

So far, I have only tried JUMP and I wonder why model generation is slow on a particular MILP- it takes more time to generate the model than for the solver to solve it. I have not put together a full example, but a basic simple model setting out only the type of constraints which are leading to a performance bottleneck. On the actual problem where the dataset is much larger, slow performance is noticeable.

The input data is contained in various CSV files and some user inputs, so the data() function generates data and saves it in dataframes.

The constraints should be easy to follow from the code and comments; otherwise, I can provide more explanation.

The simple_model(…) function already takes quite a few arguments and on the actual problem number of arguments for this function and main() function is much higher (running over 4 lines!). This is largely due to my inexperience with Julia.

Any help with improving the code for speed and arguments to functions will be much appreciated.

using DataFrames, JuMP, Random

function main()
    df1, df2, controls_min_expenses, controls_max_allowance, exclude_vec = data() # All these will be data files as arguments to the main function
    model = simple_model(df1, df2, controls_min_expenses, controls_max_allowance, exclude_vec)
    print(model) # or optimize model
end

function simple_model(df1, df2, controls_min_expenses, controls_max_allowance, exclude_vec)
    model = Model()
    @variable(model,x[i=1:nrow(df1)]>=0)   #most variables are binary in the actual problem

    # Constraint for minimum expense limit: sum(x * expenses where PARENT is P1 or P5)/sum(total expenses where PARENT is P1 or P5) >= min expneses in controls_min_expenses.LIMIT
    for i in 1:nrow(controls_min_expenses)
        index_parent_limit = findall(x -> x == controls_min_expenses.PARENT[i] , df1.PARENT)
        id_parent_limt_index_df1 = df1.IDENTITY[index_parent_limit]
        parent_index = findall(x -> x ∈ id_parent_limt_index_df1 , df1.IDENTITY)
        @constraint(model,sum(x[j] *df1.EXPENSES[j] for j ∈ parent_index) >= sum(df1[!,:EXPENSES][j] for j ∈ parent_index) * controls_min_expenses.LIMIT[i])
    end

   # Constraint for ensuring every element in a group are equal
   get_GROUP = unique(df2.GROUP)
   @variable(model,group_restrict[1:length(get_GROUP)])
   for i in 1:length(get_GROUP)
       index = findall(x -> x == get_GROUP[i] , df2.GROUP)
       id = df2.IDENTITY[index]
       id_index_df1 = findall(x -> x ∈ id , df1.IDENTITY)
       @constraint(model,[j ∈ id_index_df1],x[j] - group_restrict[i] == 0)
   end


  # Constraint to exclude if CHILD is C1 or C5 (as given by the vector exclude_vec in data()), and it is not part of a GROUP
   for i in 1:length(exclude_vec)
       id_temp2 = filter(x -> x.CHILD == exclude_vec[i],df1).IDENTITY
       for j in eachindex(id_temp2)
            if isempty(findall(x -> x == id_temp2[j] , df2.IDENTITY))
                index_df1 = findall(x -> x == id_temp2[j] , df1.IDENTITY)
                fix.(x[index_df1], 0; force = true)
            end
        end
    end

    return model
end

function data()
    Random.seed!( 0 )
    df1 = DataFrame(INDEX = collect(1:10), IDENTITY = string.("ID",collect(1:10)), NAME = randstring.(rand(5:10,10)),
    PARENT = ["P1","P1","P1","P2","P3","P4","P3","P5","P5","P6"],CHILD = ["C1","C1","C2","C2","C3","C3","C4","C5","C5","C5"],
    GRAND_CHILD = ["GC1","GC1","GC1","GC2","GC3","GC3","GC4","GC4","GC5","GC6"], EXPENSES = rand(10), ALLOWANCE = rand(10))
    # Data for groupings
    df2 = DataFrame(IDENTITY = [df1.IDENTITY[1],df1.IDENTITY[3],df1.IDENTITY[4],df1.IDENTITY[5],df1.IDENTITY[7],df1.IDENTITY[8],df1.IDENTITY[10]],
    GROUP = ["Privileged", "Privileged","Upper","Upper","Privileged","Working","Working"])
    # Data for constraints setting relative limits
    controls_min_expenses = DataFrame(PARENT = ["P1","P5"], LIMIT = [0.2,0.7])
    controls_max_allowance = DataFrame(GROUP = "Privilged", MAX_ALLOW = 0.8)
    # Data for excluding if CHILD is C1 or C5, and they are not part of a GROUP
    exclude_vec = ["C1", "C5"]
    return df1, df2, controls_min_expenses, controls_max_allowance, exclude_vec
end

It’s worth identifying whether the time is spent in JuMP or DataFrames.

Use something like GitHub - KristofferC/TimerOutputs.jl: Formatted output of timed sections in Julia, or comment out the JuMP-related lines and time how long it takes.

I tried to run the code you provided, but got the following error:

ERROR: UndefVarError: binary_var not defined
Stacktrace:
 [1] main() at /home/mtanneau/sandbox/slow_jump.jl:10
 [2] top-level scope at ./timing.jl:174 [inlined]
 [3] top-level scope at ./REPL[3]:100

You’re using binary_var when calling simple_model in main, but it’s not defined anywhere. You can actually remove the binary_var argument since it’s not actually used in the body of the simple_model function.

That being said, if I remove the print(model) and just return the model, running main() takes <150 microseconds on my laptop (I’m running Julia 1.5.4).

julia> using BenchmarkTools
julia> @benchmark main()
BenchmarkTools.Trial: 
  memory estimate:  137.78 KiB
  allocs estimate:  1962
  --------------
  minimum time:     123.800 μs (0.00% GC)
  median time:      138.900 μs (0.00% GC)
  mean time:        176.367 μs (11.58% GC)
  maximum time:     12.641 ms (96.74% GC)
  --------------
  samples:          10000
  evals/sample:     1

Not exactly slow…

2 Likes

For filtering and transforming with dataframes, the package DataFramesMeta is very useful.

I find composite types quite useful for structuring arguments and encapsulating optimization problems. I would structure the code along these lines:

using DataFrames, JuMP, Random
using Cbc

## This is our made-up custom "composite type":
mutable struct MyProblemSpecType
    model::JuMP.AbstractModel
    df1::DataFrame
    df2::DataFrame
    controls_min_expenses::DataFrame
    controls_max_allowance::DataFrame
    exclude_vec::Vector{String}
    ## Possibly add something for index sets too?
end

function main()
    ## The data() function will initialise a MyProblemSpecType instance:
    this_instance = data()  
    # Now our custom "composite type" can be passed around between functions:
    simple_model(this_instance)
    print(this_instance.model)
    set_optimizer(this_instance.model, Cbc.Optimizer)
    optimize!(this_instance.model)

    # Print the solution
    x = this_instance.model[:x]
    for i in 1:nrow(this_instance.df1)
        println("x[$(i)] = ",value.(x[i]))
    end

end

function simple_model(instance::MyProblemSpecType)
    model = instance.model ## get the model from the composite type
    df1 = instance.df1 
    df2 = instance.df2
    controls_min_expenses = instance.controls_min_expenses 
    controls_max_allowance = instance.controls_max_allowance
    exclude_vec = instance.exclude_vec
    ## ... rest of your code here ..
    return nothing ## no requirement to return if not needed, we can access `instance` from the scope of main()
end

function data()
    ## Set-up the data as before
    ## [... your data set-up code here...]
    ## but rather than this:
    # `return df1, df2, controls_min_expenses, controls_max_allowance, exclude_vec`
    ## *instead* construct and initialise the composite type with the data
    instance = MyProblemSpecType(JuMP.Model(), df1, df2, controls_min_expenses, controls_max_allowance, exclude_vec)
    ## Now everything moves together with your model.
    return instance
end


2 Likes

I’ll also add a link to my recent addition to the JuMP documentation:

https://jump.dev/JuMP.jl/dev/tutorials/Getting%20started/performance_tips/#The-"time-to-first-solve"-issue

1 Like

I have found print(model) to be a bottleneck for me, also.

@Popeye: do you really need that?

1 Like

Good spot! I will check that package on the actual problem and post back what I get.

I have edited the code to remove binary_var. Initially I had this in the code to specify variables that should be reset set as binary, but to simplify the example I removed it just before posting.

The code runs fast in this example, but if the size of the data was larger suppose 5k rows in df1 and 2k rows in df2 then it becomes slow.
Would you define these constraints in a similar way where you have to compare between different dataframes or other data structures (as in this example)? I don’t know whether findall and filter functions are best to use in this case.

Thank you! Composite types is exactly what I need to tidy the code. If I may clarify some points:
How can I put data in mutuable struct when reading it from CSV files? This is how the code would look like (before I tidy up with composite types), when reading data from CSV file rather than generating random data.

function data(file_df1, file_df2,file_controls_min_expenses,file_controls_max_allowance)
    df1 = DataFrame(CSV.File("filename.csv"))
    ....
end

function main(file_df1, file_df2,file_controls_min_expenses,file_controls_max_allowance;time_limit=300,n_threads=4)
...
end

function simple_model(file_df1, file_df2,file_controls_min_expenses,file_controls_max_allowance,timelimit,nthreads)
    model = Model()
    # time limit and nthreads are set using set_optimizer_attribute
...
end

No, I don’t need it. I was checking the model to ensure it is built correctly

I would usually use a Dict type, such as

filepathDict = Dict{Symbol,String}()
filepathDict[:basepath] = "path/to/your/data/file/instance/" ## directory
filepathDict[:df1] = "df1_filename.csv"
...
filepathDict[:controls_max_allowance] = "filename_controls_max_allowance_filename.csv"

then pass this as the argument to data() e.g.

function data(file_path_dict::Dict{Symbol,String})
     df1 = CSV.read(joinpath(file_path_dict[:basepath],file_path_dict[:df1]), DataFrame)
....
     instance = MyProblemSpecType(JuMP.Model(), file_path_dict, df1, df2, controls_min_expenses, controls_max_allowance, exclude_vec)
     return instance
end

I augment my composite type to record the file paths too in the composite type:

mutable struct MyProblemSpecType
    model::JuMP.AbstractModel
    filepathDict::Dict{Symbol,String}  ## added
    df1::DataFrame
    df2::DataFrame
    controls_min_expenses::DataFrame
    controls_max_allowance::DataFrame
    exclude_vec::Vector{String}
end
1 Like

Like @odow previously said, you need to check where most of the time is spent.

It’s worth identifying whether the time is spent in JuMP or DataFrames.
Use something like GitHub - KristofferC/TimerOutputs.jl: Formatted output of timed sections in Julia , or comment out the JuMP-related lines and time how long it takes.

This is how I would proceed:

  1. Build an example that is slow enough (so that timing is relevant) but still runs in a few seconds (so that you can run it multiple times)
  2. See where most of the time is spent. My preferred tool for doing that is the TimerOutputs linked above (which tracks run time and memory allocations). This means annotating your code to measure each component’s runtime (and memory)
  3. Run your code with timing, and look at the results. From there, find the bottleneck, and make it more efficient.
  4. Repeat
2 Likes

Thank you! It makes sense

1 Like

Thanks again.

1 Like