Help improving the speed of a DataFrames operation

pdeffebach · December 18, 2023, 3:23pm

Do you mean


cons_time_blocks = [1:2; 3:4; ...; 997:998; 999:1000]
flow_time_blocks = [1:3; 4:6,;...; 996:999; 1000:1000]

As in, these ranges are vcated together?

Please produce an MWE so I can help debug the join.

In particular, try to reproduce the scenario that made VSCode freeze during the join

abelsiqueira · December 18, 2023, 4:50pm

@pdeffebach: As in, these ranges are vcat ed together?

No, its a Vector{UnitRange{Int}}, i.e., tjme blocks are ranges, and the cons_time_blocks and flow_time_blocks variables are vectors of time blocks.

Please produce an MWE so I can help debug the join.

Done. I have separated 4 strategies mentioned before:

“Current best”, which uses Tables.rows and idx = findall
“Also decent”, which uses Tables.rows and the more traditional @view
“Older strategy”, which uses a variation of the early strategy that I mentioned
“Leftjoin strategy”, which tries leftjoin with @chain

The data is in GitHub - abelsiqueira/TulipaEnergyModel.jl, branch mwe-discourse, file mwe.jl. It might be easier to simply clone.
Here are the cloning steps on linux:

cd $(mktemp -d)
git clone https://github.com/abelsiqueira/TulipaEnergyModel.jl .
git checkout mwe-discourse
julia --project
pkg> instantiate
julia> include("mwe.jl")

This will print instruction and the timing on the “Tiny” data:

Current best:
  0.001040 seconds (13.03 k allocations: 1005.875 KiB)
Also decent:
  0.001191 seconds (16.63 k allocations: 1.159 MiB)
Older strategy
  0.003551 seconds (282.17 k allocations: 12.629 MiB)
Leftjoin strategy
  0.023693 seconds (99.80 k allocations: 8.522 MiB, 84.62% gc time)

You change search for input_dir in the file, and comment out the line with the “EU” path. The output for me are:

Current best:
 99.465536 seconds (4.59 M allocations: 2.132 GiB, 0.17% gc time)
Also decent:
100.750630 seconds (5.93 M allocations: 2.181 GiB, 0.44% gc time)
Older strategy
# Gave up after maybe 15 minutes

The leftjoin strategy simply kills my VSCode or my terminal after ~1 minute.

Dan · December 18, 2023, 6:03pm

Try this new method (not based on join):

function add_incoming_term!(df_cons, df_flows)
    Tmin = min(minimum(first.(df_flows.time_block)),minimum(first.(df_cons.time_block)))
    Tmax = max(maximum(last.(df_flows.time_block)),maximum(last.(df_cons.time_block)))
	Tspan = Tmin:Tmax
	nT = length(Tspan)

	pre_incoming = Vector{AffExpr}(undef, nT)
	df_cons.incoming_term .= AffExpr(0.0)

	g_flows = groupby(df_flows, [:rp, :to])
	g_cons = groupby(df_cons, [:rp, :asset])

	for ((rp, to), sdf) in pairs(g_cons)
		haskey(g_flows, (;rp, to)) || continue
		pre_incoming .= AffExpr(0.0)
		for row in eachrow(g_flows[(;rp, to)]), t in row.time_block
			pre_incoming[t-Tmin+1] += row.flow * 3.14 * rp
		end
		for row in eachrow(sdf)
			row.incoming_term = sum(pre_incoming[t-Tmin+1] for 
			  t in row.time_block; init=AffExpr(0.0))
		end
	end
end

Use it intuitively as:

julia> add_incoming_term!(df_cons, df_flows)

julia> df_cons
8×5 DataFrame
 Row │ asset   rp     time_block  index  incoming_term                     
     │ String  Int64  UnitRange…  Int64  AffExpr                           
─────┼─────────────────────────────────────────────────────────────────────
   1 │ p1          1  1:3             1  0
   2 │ p1          1  4:6             2  0
   3 │ p1          2  1:6             3  0
   4 │ p2          1  1:6             4  0
   5 │ p2          2  1:6             5  0
   6 │ d           1  1:4             6  9.42 flow[(p1,d), 1, 1:3] + 6.28…
   7 │ d           1  5:6             7  6.28 flow[(p1,d), 1, 4:6] + 6.28…
   8 │ d           2  1:6             8  37.68 flow[(p1,d), 2, 1:6] + 37.…

pdeffebach · December 18, 2023, 7:41pm

After playing around with the data a little, I think this zip(gd1, gd2) method will be more effective than a leftjoin.

One thing i didn’t get about OP’s data before looking at it is that the leftjoin is many-to-many and therefore the output of leftjoin is huge. So zip-ing the two grouped data frames together is a very good idea.

As a small piece of advice,

		for row in eachrow(sdf)

is going to be slow (due to the type instability of data frames mentioned above). and it’s better to replace this with a function taking in vectors directly.

Note that groupby(df, ...) is very optimized. You can improve this by making df_flows.rp etc. a PooledArray from PooledArrays.jl . This should make the groupby faster, but I don’t think that’s the bottleneck.

rocco_sprmnt21 · December 18, 2023, 9:56pm

try this

Summary

df_flows = DataFrame(;
    from = ["p1", "p1", "p2", "p2", "p1", "p2"],
    to = ["d", "d", "d", "d", "d", "d"],
    rp = [1, 1, 1, 1, 2, 2],
    time_block = [1:3, 4:6, 1:2, 3:6, 1:6, 1:6],
    index = 1:6,
)
model = Model()
df_flows.flow = [
    @variable(
        model,
        base_name = "flow[($(row.from),$(row.to)), $(row.rp), $(row.time_block)]"
    ) for row in eachrow(df_flows)
]

df_cons = DataFrame(;
    asset = ["p1", "p1", "p1", "p2", "p2", "d", "d", "d"],
    rp = [1, 1, 2, 1, 2, 1, 1, 2],
    time_block = [1:3, 4:6, 1:6, 1:6, 1:6, 1:4, 5:6, 1:6],
    index = 1:8,
)


df_cons.incoming_term .= AffExpr(0)
cons=sort(df_cons,[:rp,:asset])
flows=sort(df_flows,[:rp,:to])


function ss(frp,fto,crp,cass, ftb,ctb, flow,it)
    idr=[searchsorted(collect(zip(frp,fto)), (a,r)) for (a,r) in zip(crp,cass)]
    id=findall(!isempty,idr)
    for i in id
        len = [length(∩(ctb[i] , fl))*π for fl in ftb[idr[i]]]
        it[i]=len' * flow[idr[i]]
    end
end

ss(flows.rp,flows.to,cons.rp,cons.asset,flows.time_block,cons.time_block,flows.flow, cons.incoming_term)

Dan · December 18, 2023, 10:38pm

add *crp[i] to this expression.

rocco_sprmnt21 · December 19, 2023, 8:11am

Of course, and as usual, you’re right.
I take this opportunity to give a more meaningful name to the function

julia> function cons_inc_term!(frp,fto,crp,cass, ftb,ctb, flow,it) 
           rp_to = collect(zip(frp,fto))
           idr=[searchsorted(rp_to, (a,r)) for (a,r) in zip(crp,cass)]
           id=findall(!isempty,idr)
           for i in id
              #  len = length.((∩).([ctb[i]] , ftb[idr[i]]))*3.14*crp[i]
               len = [length(∩(ctb[i] , fl))*π*crp[i] for fl in ftb[idr[i]]]
               it[i]=len' * flow[idr[i]]
           end
       end
cons_inc_term! (generic function with 1 method)

julia> cons_inc_term!(flows.rp,flows.to,cons.rp,cons.asset,flows.time_block,cons.time_block,flows.flow, cons.incoming_term)

julia> cons
8×5 DataFrame
 Row │ asset   rp     time_block  index  incoming_term            ⋯
     │ String  Int64  UnitRange…  Int64  AffExpr                  ⋯
─────┼─────────────────────────────────────────────────────────────
   1 │ d           1  1:4             6  9.42 flow[(p1,d), 1, 1:3 ⋯
   2 │ d           1  5:6             7  6.28 flow[(p1,d), 1, 4:6  
   3 │ p1          1  1:3             1  0
   4 │ p1          1  4:6             2  0
   5 │ p2          1  1:6             4  0                        ⋯
   6 │ d           2  1:6             8  37.68 flow[(p1,d), 2, 1:  
   7 │ p1          2  1:6             3  0
   8 │ p2          2  1:6             5  0
                                                   1 column omitted

abelsiqueira · December 19, 2023, 8:42am

@Dan, the function has an error because it is overwriting the pre_incoming variable for different values of from. I changed it to the code below, and it appears to be right, I will have to check later. Overall this was blazingly fast. It went down from 90s from the previous best to 0.9s. So 100x speedup. Thanks a lot for the help.

Here is the modified version:

function add_incoming_term!(df_cons, df_flows)
    Tmin = min(minimum(first.(df_flows.time_block)), minimum(first.(df_cons.time_block)))
    Tmax = max(maximum(last.(df_flows.time_block)), maximum(last.(df_cons.time_block)))
    Tspan = Tmin:Tmax
    nT = length(Tspan)
    from_list = unique(df_flows.from)
    from_lookup = Dict(a => i for (i, a) in enumerate(from_list))
    nA = length(from_list)

    pre_incoming = Matrix{AffExpr}(undef, nT, nA)
    df_cons.incoming_term .= AffExpr(0.0)

    g_flows = groupby(df_flows, [:rp, :to])
    g_cons = groupby(df_cons, [:rp, :asset])

    for ((rp, to), sdf) in pairs(g_cons)
        haskey(g_flows, (; rp, to)) || continue
        pre_incoming .= AffExpr(0.0)
        for row in eachrow(g_flows[(; rp, to)]), t in row.time_block
            j = from_lookup[row.from]
            pre_incoming[t-Tmin+1, j] = row.flow * 3.14 * rp
        end
        for row in eachrow(sdf)
            row.incoming_term =
                sum(sum(pre_incoming[t-Tmin+1, :]) for t in row.time_block; init = AffExpr(0.0))
        end
    end
end

abelsiqueira · December 19, 2023, 9:08am

@rocco_sprmnt21, thanks for the suggestion. Very interesting, and very fast for the smaller cases. But for the really big case, it slows down to 9s, vs 0.9 from the solution supplied by @Dan. It also allocates 35G according to @time.

I have made minor modification to the call itself. Here is the code:

function rocco_strategy(df_cons, df_flows)
    df_cons.incoming_term_rocco .= AffExpr(0)
    # sort!(df_cons, [:rp, :asset])
    # sort!(df_flows, [:rp, :to])

    function cons_inc_term!(frp, fto, crp, cass, ftb, ctb, flow, it)
        rp_to = collect(zip(frp, fto))
        idr = [searchsorted(rp_to, (a, r)) for (a, r) in zip(crp, cass)]
        id = findall(!isempty, idr)
        for i in id
            len = [length(∩(ctb[i], fl)) * 3.14 * crp[i] for fl in ftb[idr[i]]]
            it[i] = len' * flow[idr[i]]
        end
    end

    cons_inc_term!(
        df_flows.rp,
        df_flows.to,
        df_cons.rp,
        df_cons.asset,
        df_flows.time_block,
        df_cons.time_block,
        df_flows.flow,
        df_cons.incoming_term_rocco,
    )
end

The comparison:

Dan strategy
  0.386765 seconds (6.13 M allocations: 379.483 MiB, 24.98% gc time)
Rocco strategy
  7.545171 seconds (551.91 k allocations: 35.770 GiB, 3.86% gc time)

rocco_sprmnt21 · December 19, 2023, 10:25am

Since the performance evaluation depends on the size of the data, could you provide dummy data of the appropriate size?
I add a couple of observations on my proposal:
It would be useful to have a searchsorted() method that works on an iterator like zip() without having to do collect(). Does anyone have any idea how the collect could be avoided?;
as regards the sorting of the two dataframes, only one is necessary for the use of searchsorted: the one on which the searches are carried out.

abelsiqueira · December 19, 2023, 10:53am

This is the data: Help improving the speed of a DataFrames operation - #22 by abelsiqueira
The latest commit has your function in the mwe.jl file.

rocco_sprmnt21 · December 19, 2023, 11:07am

this requires and exploits (I don’t know if in a useful way in the end) the ordering of both dataframes

function sorted_cons_inc_term!(frp,fto,crp,cass, ftb,ctb, flow,it)
    rp_to = tuple.(frp,fto)
    l=0
    for i in eachindex(crp)
    f=searchsortedfirst((@view rp_to[l+1:end]),(crp[i],cass[i]))
    l=searchsortedlast((@view rp_to[l+1:end]),(crp[i],cass[i]))
      if  !isempty(f:l) 
        len = [length(∩(ctb[i] , fl))*π*crp[i] for fl in @view ftb[f:l]]
        it[i]=len' * @view flow[f:l]
      end
    end
end

Unfortunately I can’t easily use these instructions.

cd $(mktemp -d)
git clone https://github.com/abelsiqueira/TulipaEnergyModel.jl .
git checkout mwe-discourse
julia --project
pkg> instantiate
julia> include("mwe.jl")

Should they be used in the shell or in the REPL?
I have a PC with Windows 10(11?) system, is that ok?

Dan · December 19, 2023, 11:34am

Indeed the function had an error for a while with overwriting in pre_incoming, then I changed it from pre_incoming = to pre_incoming += in the code, which seems to achieve the same as the change you suggested. Maybe you tested my code before it was with the +=. If this is so, then the += method should be a little faster than the matrix change you suggested. So you can check it out.
(the change is in my post - just edited it in at some point yesterday).

abelsiqueira · December 19, 2023, 12:12pm

No, they assume Linux. But you can also clone directly from VSCode: GitHub - abelsiqueira/TulipaEnergyModel.jl and then switch to branch mwe-discourse.
The file mwe.jl is the relevant file, and the input_dir defines which data is being used.

abelsiqueira · December 19, 2023, 12:14pm

Indeed, I have the old one - I check the e-mail first, so I might miss the edits.
I will test this change after I implement the problem in the bigger picture. Summing these AffExpr terms individually is slow, so the allocation might be warranted. Let’s see.

Dan · December 19, 2023, 12:51pm

This expression allocates because of slicing pre_incoming instead of using a view. Also, the initial value is probably spurious as the whole array is zeroed. So this should do it:

sum(@view(pre_incoming[row.time_block.-(Tmin-1), :]))

and a last optimization to check would be to switch the axes of the matrix, as summing rows and cols can vary in speed (but this might be negligible here).

rocco_sprmnt21 · December 20, 2023, 12:17pm

abelsiqueira:

Ignoring all else, you can thinks of it this way:
cons_time_blocks = [1:2, 3:4, ..., 997:998, 999:1000]
flow_time_blocks = [1:3, 4:6, ..., 996:999, 1000:1000]
I.e., each time_blocks array is a partitions of 1:1000. In this example, the cons_time_blocks is the partition taking 2 points at a time, and flow_time_blocks is a partition taking 3 points at a time.

Now, for each index of cons_time_blocks, I want every flow_time_blocks that intersect with it. This is the essence of the problem.

Summary


julia> cons = sort(df_cons, [:rp,:asset,:tb])
118260×5 DataFrame
    Row │ asset            rp     tb         index   incoming_ ⋯
        │ String31         Int64  UnitRang…  Int64   AffExpr   ⋯
────────┼───────────────────────────────────────────────────────
      1 │ BE_Battery           1  1:2        100741  0         ⋯
      2 │ BE_Battery           1  3:4        100742  0
      3 │ BE_Battery           1  5:6        100743  0
      4 │ BE_Battery           1  7:8        100744  0
      5 │ BE_Battery           1  9:10       100745  0         ⋯
      6 │ BE_Battery           1  11:12      100746  0
      7 │ BE_Battery           1  13:14      100747  0
      8 │ BE_Battery           1  15:16      100748  0
      9 │ BE_Battery           1  17:18      100749  0         ⋯
     10 │ BE_Battery           1  19:20      100750  0
     11 │ BE_Battery           1  21:22      100751  0
     12 │ BE_Battery           1  23:24      100752  0
     13 │ BE_Battery           1  25:26      100753  0         ⋯
     14 │ BE_Battery           1  27:28      100754  0
   ⋮    │        ⋮           ⋮        ⋮        ⋮           ⋮   ⋱
 118248 │ NL_Wind_Onshore      1  8748:8748   17508  0
 118249 │ NL_Wind_Onshore      1  8749:8749   17509  0
 118250 │ NL_Wind_Onshore      1  8750:8750   17510  0         ⋯
 118251 │ NL_Wind_Onshore      1  8751:8751   17511  0
 118252 │ NL_Wind_Onshore      1  8752:8752   17512  0
 118253 │ NL_Wind_Onshore      1  8753:8753   17513  0
 118254 │ NL_Wind_Onshore      1  8754:8754   17514  0         ⋯
 118255 │ NL_Wind_Onshore      1  8755:8755   17515  0
 118256 │ NL_Wind_Onshore      1  8756:8756   17516  0
 118257 │ NL_Wind_Onshore      1  8757:8757   17517  0
 118258 │ NL_Wind_Onshore      1  8758:8758   17518  0         ⋯
 118259 │ NL_Wind_Onshore      1  8759:8759   17519  0
 118260 │ NL_Wind_Onshore      1  8760:8760   17520  0
                                1 column and 118233 rows omitted

julia> flows = sort(df_flows, [:rp,:to,:tb])
131400×6 DataFrame
    Row │ from          to           rp     tb         index   ⋯
        │ String31      String15     Int64  UnitRang…  Int64   ⋯
────────┼───────────────────────────────────────────────────────
      1 │ BE_E_Balance  BE_Battery       1  1:2        127021  ⋯
      2 │ BE_E_Balance  BE_Battery       1  3:4        127022   
      3 │ BE_E_Balance  BE_Battery       1  5:6        127023   
      4 │ BE_E_Balance  BE_Battery       1  7:8        127024   
      5 │ BE_E_Balance  BE_Battery       1  9:10       127025  ⋯
      6 │ BE_E_Balance  BE_Battery       1  11:12      127026   
      7 │ BE_E_Balance  BE_Battery       1  13:14      127027   
      8 │ BE_E_Balance  BE_Battery       1  15:16      127028   
      9 │ BE_E_Balance  BE_Battery       1  17:18      127029  ⋯
     10 │ BE_E_Balance  BE_Battery       1  19:20      127030   
     11 │ BE_E_Balance  BE_Battery       1  21:22      127031   
     12 │ BE_E_Balance  BE_Battery       1  23:24      127032   
     13 │ BE_E_Balance  BE_Battery       1  25:26      127033  ⋯
     14 │ BE_E_Balance  BE_Battery       1  27:28      127034   
   ⋮    │      ⋮             ⋮         ⋮        ⋮        ⋮     ⋱
 131388 │ NL_E_Import   NL_E_Demand      1  8754:8754   87594   
 131389 │ NL_E_Balance  NL_E_Demand      1  8755:8755   61315   
 131390 │ NL_E_Import   NL_E_Demand      1  8755:8755   87595  ⋯
 131391 │ NL_E_Balance  NL_E_Demand      1  8756:8756   61316   
 131392 │ NL_E_Import   NL_E_Demand      1  8756:8756   87596   
 131393 │ NL_E_Balance  NL_E_Demand      1  8757:8757   61317   
 131394 │ NL_E_Import   NL_E_Demand      1  8757:8757   87597  ⋯
 131395 │ NL_E_Balance  NL_E_Demand      1  8758:8758   61318   
 131396 │ NL_E_Import   NL_E_Demand      1  8758:8758   87598   
 131397 │ NL_E_Balance  NL_E_Demand      1  8759:8759   61319   
 131398 │ NL_E_Import   NL_E_Demand      1  8759:8759   87599  ⋯
 131399 │ NL_E_Balance  NL_E_Demand      1  8760:8760   61320   
 131400 │ NL_E_Import   NL_E_Demand      1  8760:8760   87600   
                                1 column and 131373 rows omitted

julia> begin
       gc=groupby(cons,[:rp,:asset],sort=true)
       gf=groupby(flows,[:rp,:to],sort=true)
       gci=combine(gc,groupindices=>:gci)
       gfi=combine(gf,groupindices=>:gfi)
       dfj=leftjoin(gfi,gci,on=[:rp,:to=>:asset])
       dfj
       end
6×4 DataFrame
 Row │ rp     to            gfi    gci    
     │ Int64  String15      Int64  Int64?
─────┼────────────────────────────────────
   1 │     1  BE_Battery        1       1
   2 │     1  BE_E_Balance      2       2
   3 │     1  BE_E_Demand       3       3
   4 │     1  NL_Battery        4      10
   5 │     1  NL_E_Balance      5      11
   6 │     1  NL_E_Demand       6      12

julia> function int_tb(gc,gf)
           tbc,tbf,crp,fl=gc.tb,gf.tb, gc.rp, gf.flow
           tbf_start = [t.start for t in tbf]
           tbf_stop = [t.stop for t in tbf]
           for (i,r) in enumerate(tbc)
               f=searchsortedfirst(tbf_stop,r.start)
               l=searchsortedlast(tbf_start,r.stop)
               e=zero(AffExpr)
               for ir in f:l
                   e+=length(∩(tbf[ir], r)) * 3.14 * crp[i]*fl[ir]
               end
               gc.incoming_term[i]= e
           end
       end
int_tb (generic function with 1 method)

julia> function inc_term!(dfj, gc,gf)
           tt=tuple.(dfj.gfi,dfj.gci)
           for (f,c) in tt
               int_tb(gc[c],gf[f])
           end
       end
inc_term! (generic function with 1 method)

I hadn’t read the details of the problem carefully enough and didn’t take advantage of the specifics.
I finally managed to create some tables from the repository clone and did some testing on these.
The scripts I used require the tables to be sorted by the :tb column as well as :rp and :asset.
I’m not 100% sure I built the same tables you used in the benchmark.
This is the measurement I found anyway

julia> @btime inc_term!(dfj,gc,gf)
  209.512 ms (3394343 allocations: 218.71 MiB)

PS
I saved the tables created in the repository as CSV. When rereading them, using a DataFrame as a sink, the ranges are understood as String15. I had to rebuild the UnitRange type by hand.
I’m wondering if anyone knows a way to do this using system functions.

types=Dict(:rp => UnitRange)?????

julia> using CSV, DataFrames, JuMP, BenchmarkTools

julia> function to_rng(str)
           (:)(parse.(Int64,split(str,':'))...)
       end
to_rng (generic function with 1 method)

julia> dfc=CSV.read("df_cons.csv", DataFrame)
118260×4 DataFrame
    Row │ asset          rp     time_block  index  
        │ String31       Int64  String15    Int64  
────────┼──────────────────────────────────────────
      1 │ NL_E_Demand        1  1:1              1
      2 │ NL_E_Demand        1  2:2              2
      3 │ NL_E_Demand        1  3:3              3
      4 │ NL_E_Demand        1  4:4              4
      5 │ NL_E_Demand        1  5:5              5
      6 │ NL_E_Demand        1  6:6              6
      7 │ NL_E_Demand        1  7:7              7
      8 │ NL_E_Demand        1  8:8              8
      9 │ NL_E_Demand        1  9:9              9
     10 │ NL_E_Demand        1  10:10           10
     11 │ NL_E_Demand        1  11:11           11
     12 │ NL_E_Demand        1  12:12           12
     13 │ NL_E_Demand        1  13:13           13
     14 │ NL_E_Demand        1  14:14           14
   ⋮    │       ⋮          ⋮        ⋮         ⋮
 118248 │ BE_Pump_Hydro      1  8735:8736   118248
 118249 │ BE_Pump_Hydro      1  8737:8738   118249
 118250 │ BE_Pump_Hydro      1  8739:8740   118250
 118251 │ BE_Pump_Hydro      1  8741:8742   118251
 118252 │ BE_Pump_Hydro      1  8743:8744   118252
 118253 │ BE_Pump_Hydro      1  8745:8746   118253
 118254 │ BE_Pump_Hydro      1  8747:8748   118254
 118255 │ BE_Pump_Hydro      1  8749:8750   118255
 118256 │ BE_Pump_Hydro      1  8751:8752   118256
 118257 │ BE_Pump_Hydro      1  8753:8754   118257
 118258 │ BE_Pump_Hydro      1  8755:8756   118258
 118259 │ BE_Pump_Hydro      1  8757:8758   118259
 118260 │ BE_Pump_Hydro      1  8759:8760   118260
                                118233 rows omitted

julia> dfc.tb=to_rng.(dfc.time_block)
118260-element SentinelArrays.ChainedVector{UnitRange{Int64}, Vector{UnitRange{Int64}}}:
 1:1
 2:2
 3:3
 4:4
 5:5
 6:6
 7:7
 8:8
 9:9
 10:10
 11:11
 12:12
 13:13
 14:14
 15:15
 16:16
 ⋮
 8731:8732
 8733:8734
 8735:8736
 8737:8738
 8739:8740
 8741:8742
 8743:8744
 8745:8746
 8747:8748
 8749:8750
 8751:8752
 8753:8754
 8755:8756
 8757:8758
 8759:8760

julia> df_cons=dfc[:,[1,2,5,4]]
118260×4 DataFrame
    Row │ asset          rp     tb         index  
        │ String31       Int64  UnitRang…  Int64
────────┼─────────────────────────────────────────
      1 │ NL_E_Demand        1  1:1             1
      2 │ NL_E_Demand        1  2:2             2
      3 │ NL_E_Demand        1  3:3             3
      4 │ NL_E_Demand        1  4:4             4
      5 │ NL_E_Demand        1  5:5             5
      6 │ NL_E_Demand        1  6:6             6
      7 │ NL_E_Demand        1  7:7             7
      8 │ NL_E_Demand        1  8:8             8
      9 │ NL_E_Demand        1  9:9             9
     10 │ NL_E_Demand        1  10:10          10
     11 │ NL_E_Demand        1  11:11          11
     12 │ NL_E_Demand        1  12:12          12
     13 │ NL_E_Demand        1  13:13          13
     14 │ NL_E_Demand        1  14:14          14
   ⋮    │       ⋮          ⋮        ⋮        ⋮
 118248 │ BE_Pump_Hydro      1  8735:8736  118248
 118249 │ BE_Pump_Hydro      1  8737:8738  118249
 118250 │ BE_Pump_Hydro      1  8739:8740  118250
 118251 │ BE_Pump_Hydro      1  8741:8742  118251
 118252 │ BE_Pump_Hydro      1  8743:8744  118252
 118253 │ BE_Pump_Hydro      1  8745:8746  118253
 118254 │ BE_Pump_Hydro      1  8747:8748  118254
 118255 │ BE_Pump_Hydro      1  8749:8750  118255
 118256 │ BE_Pump_Hydro      1  8751:8752  118256
 118257 │ BE_Pump_Hydro      1  8753:8754  118257
 118258 │ BE_Pump_Hydro      1  8755:8756  118258
 118259 │ BE_Pump_Hydro      1  8757:8758  118259
 118260 │ BE_Pump_Hydro      1  8759:8760  118260
                               118233 rows omitted

julia> dff=CSV.read("df_flows.csv", DataFrame)
131400×5 DataFrame
    Row │ from             to            rp     time_block  in ⋯
        │ String31         String15      Int64  String15    In ⋯
────────┼───────────────────────────────────────────────────────
      1 │ NL_Wind_Onshore  NL_E_Balance      1  1:1            ⋯
      2 │ NL_Wind_Onshore  NL_E_Balance      1  2:2

With some small changes

v2

julia> cons = sort(df_cons, [:rp,:asset,:tb])
118260×5 DataFrame
    

julia> flows = sort(df_flows, [:rp,:to,:tb])
131400×6 DataFrame

df_cons.incoming_term .= AffExpr(0.0)
cons = sort(df_cons, [:rp,:asset,:tb])
flows = sort(df_flows, [:rp,:to,:tb])


function main(cons,flows)
    function int_tb(gc,gf)
        tbc,tbf,crp,fl=gc.tb,gf.tb, gc.rp, gf.flow
        tbf_start = first.(tbf)
        tbf_stop = last.(tbf)
        for (i,r) in enumerate(tbc)
            f=searchsortedfirst(tbf_stop,r.start)
            l=searchsortedlast(tbf_start,r.stop)
            len=length.((∩).((@view tbf[f:l]), Ref(r))) * 3.14 * crp[i]
            gc.incoming_term[i]= len' * @view fl[f:l]
        end
    end
    
    
    function inc_term!(dfj, gc,gf)
        tt=tuple.(dfj.gfi,dfj.gci)
        for (f,c) in tt
            int_tb(gc[c],gf[f])
        end
    end
    gc=groupby(cons,[:rp,:asset],sort=true)
    gf=groupby(flows,[:rp,:to],sort=true)
    gci=combine(gc,groupindices=>:gci)
    gfi=combine(gf,groupindices=>:gfi)
    dfj=leftjoin(gfi,gci,on=[:rp,:to=>:asset])

    inc_term!(dfj,gc,gf);
end

# julia>          @btime inc_term!(dfj,gc,gf);
#   120.013 ms (1442627 allocations: 75.05 MiB)


# julia> @btime main(cons, flows)
#   125.609 ms (1443507 allocations: 78.91 MiB)

julia> @btime add_incoming_term!(df_cons, df_flows)        
  441.868 ms (6468746 allocations: 427.07 MiB)

julia> sort(df_cons,:asset).incoming_term == cons.incoming_term
true

abelsiqueira · February 22, 2024, 9:25am

Hey, sorry for the lack of replies, I went on vacation in December and forgot to reply in January.

For completion: We’ve implement a solution based on @Dan’s reply: TulipaEnergyModel.jl/src/create-model.jl at e2b5e8891a4edbfa0c1043847f7a56b146d6f00d · TulipaEnergy/TulipaEnergyModel.jl · GitHub and it was merged a few weeks ago.

I haven’t checked your latest solution @rocco_sprmnt21, so I don’t know how it compares. It is not easy to test it with the current code, so I will have to park it for now.
With the improvements that we made, we decided that it is good enough for now, and moved to finished some other more pressing objectives. We will come back to performance later, so I might revisit this in the future.

The merged version was 3.2 times faster than what we had!

Thank you all for the help.

Topic		Replies	Views
Lighter alternatives to a dataframe when you just want to run a few queries on a vector of data structs General Usage	31	977	August 12, 2023
Who does "better" than DataFrames? Performance dataframes	43	2021	April 6, 2023
[ANN] DataFrameDBs.jl Data package , announcement	60	4050	May 2, 2020
How to speed up the for-loop with dataframe access Performance dataframes	25	1180	April 14, 2022
The state of DataFrames.jl H2O benchmark Package Announcements dataframes	53	9377	January 1, 2025

Help improving the speed of a DataFrames operation

Related topics