Help improving the speed of a DataFrames operation

@pdeffebach: As in, these ranges are vcat ed together?

No, its a Vector{UnitRange{Int}}, i.e., tjme blocks are ranges, and the cons_time_blocks and flow_time_blocks variables are vectors of time blocks.

Please produce an MWE so I can help debug the join.

Done. I have separated 4 strategies mentioned before:

  • “Current best”, which uses Tables.rows and idx = findall
  • “Also decent”, which uses Tables.rows and the more traditional @view
  • “Older strategy”, which uses a variation of the early strategy that I mentioned
  • “Leftjoin strategy”, which tries leftjoin with @chain

The data is in GitHub - abelsiqueira/TulipaEnergyModel.jl, branch mwe-discourse, file mwe.jl. It might be easier to simply clone.
Here are the cloning steps on linux:

cd $(mktemp -d)
git clone https://github.com/abelsiqueira/TulipaEnergyModel.jl .
git checkout mwe-discourse
julia --project
pkg> instantiate
julia> include("mwe.jl")

This will print instruction and the timing on the “Tiny” data:

Current best:
  0.001040 seconds (13.03 k allocations: 1005.875 KiB)
Also decent:
  0.001191 seconds (16.63 k allocations: 1.159 MiB)
Older strategy
  0.003551 seconds (282.17 k allocations: 12.629 MiB)
Leftjoin strategy
  0.023693 seconds (99.80 k allocations: 8.522 MiB, 84.62% gc time)

You change search for input_dir in the file, and comment out the line with the “EU” path. The output for me are:

Current best:
 99.465536 seconds (4.59 M allocations: 2.132 GiB, 0.17% gc time)
Also decent:
100.750630 seconds (5.93 M allocations: 2.181 GiB, 0.44% gc time)
Older strategy
# Gave up after maybe 15 minutes

The leftjoin strategy simply kills my VSCode or my terminal after ~1 minute.