Slow performance with Julia on optimization problems

I am getting slow performance with JuliaDB when I run optimization problems. I store data for all simulations in one CSV file for each set of data, and then filter it for the relevant simulation and make a call to the optimization function in a loop.

If I only include the data for one simulation in CSV files then run-time is about 4 seconds. However, when the CSV file contains data for six simulations and I do a run for the all the scenarios, then the total run-time is 55 seconds, even if I use same set of data for each simulation.

I suspect the reason for this is that the program spends a lot of time filtering data for each scenario in the loop. I have given a simple example below to explain how my code works, but the actual program obviously has a lot more data.

I have tired to parallelize it using Distrubuted; addprocs(4) but I did not had any success with it. Even if the code is parallelized, the maximum reduction in time will be 4 times on a 4 core processor, so still 55/4 = 13.75 seconds per scenario which is high compared to the run time of 4 seconds (if data of only one scenario is contained in the CSV files).

To improve the run-time, I reorganised the data for one particular simulation so that it can be read as a matrix using Matrix{Float64}(CSV.read("…") function. This reduced run-time to 0.4 seconds, but I am not sure how I can set up a loop to run the optimization if data for each simulation is stored in a separate CSV file.

I will appreciate if some one can guide me on:
1. How to improve the run-time, even if this means reorganising input data if I have to run the program for all simulations? For example, if separate CSV files for each scenario will be best then how can I make the program to read the relevant files and store the values in a matrix.
2. Is it possible to parallelize the operation in question 1?

The code is given below. Data similar to that in the CSV files is constructed using table function.

using JuliaDB, JuMP, Clp, LinearAlgebra, NPFinancial, Tables, DataFrames, CSV

function Optimizer(Prices,Liability_CF,Assets_CF)
    model = Model(Clp.Optimizer)
    @variable(model, 0<= x[b = 1:5])
    @objective(model,Min,sum(Prices[i]*x[i] for i=1:5))
    @constraint(model, Assets_CF*x .>= Liability_CF)
    optimize!(model)
    return JuMP.value.(x)
end

function Yield(Bonds,Assets_CF,Prices)
    Assets = Float64.(vcat(-dot(Prices,Bonds), vec(Assets_CF*Bonds)))
    return irr(Assets)
end
# Data is loaded using loadtable function of JuliaDB.
# Each of these three are loaded as separate CSV files
prices = table([1,1,1,1,1,2,2,2,2,2],[99.74, 91.22, 98.71, 103.75, 97.15,100,93,94,105,101];names = [:Simulation, :Prices])
cashFlowLiab = table([1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2],[5000, 7000, 7000, 6000, 8000, 7000, 2000, 1000, 555.5, 655.5, 100, 555.5, 655.5,1000,500,1500];names = [:Simulation, :Liability_CFs])
cashFlows = table([1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2],[4,4,4,4,4,4,4,104,3.8,3.8,3.8,3.8,3.8,3.8,3.8,98.8],[5,5,5,5,5,5,105,0,4.75,4.75,4.75,4.75,4.75,4.75,99.75,0],[2.5,2.5,2.5,2.5,102.5,0,0,0,2.375,2.375,2.375,2.375,97.375,0,0,0],[5,5,5,5,5,105,0,0,4.75,4.75,4.75,4.75,4.75,99.75,0,0],[4,4,4,4,4,104,0,0,3.8,3.8,3.8,3.8,3.8,98.8,0,0];names = [:Simulation, :T1,:T2,:T3,:T4,:T5])

for nsim = 1:2
    Prices = (JuliaDB.select(filter(i -> (i.Simulation == nsim), prices),(:Prices)))'
    ACF = JuliaDB.select(filter(i -> (i.Simulation == nsim), cashFlows),All(Between(:T1, :T5)))
    Assets_CF = (Tables.matrix(ACF))
    Liability_CF  = JuliaDB.select(filter(i -> (i.Simulation == nsim), cashFlowLiab),(:Liability_CFs))
    Bonds = Optimizer(Prices,Liability_CF,Assets_CF)
    CSV.write("noOfBonds.csv",Tables.table(Bonds'), append = nsim !=1, writeheader=false)
    PortfolioYield = Yield(Bonds,Assets_CF,Prices)
    CSV.write("PortfolioReturn.csv",DataFrame(Simulation = nsim, Yield = PortfolioYield), append = nsim !=1)
end

This seems a little strange as JuliaDB seems to boast being very fast, specially with CSV files. Before trying to parallelize I would advise what you already considered, separate the data by simulation. This way you do not execute an operation over the whole dataset (i.e., selecting the lines with the right value in the simulation field) every time you need to execute an operation over a subset of the dataset. Seems to me that you have O(n^2) performance for something that can be O(n) or O(n log n).

Thanks @Henrique_Becker

I have now tested it by separating the data for each simulation in separate CSV files. This reduced the run time for six scenarios on the whole data set to 1.2 seconds (median using benchmark tools).

I have to use a specific naming convention for the CSV files in order to read them sequentially in a loop.

I am pretty much sure there will be better ways of managing the data, and it should be possible to parallelize the processing especially the loop part in the code shared above as all the runs are independent. The difficulty I have in understanding is how parallel processing will work- the CLP solver that I am using is multi-threaded, so wouldn’t parallelizing any other operation like that in the loop conflict with it?

I have tried to use Distributed but couldn’t make it to work for this example code.