Adding a NamedTuple to DataFrame row

I have a simulation model that iteratively produces NamedTuples which I want to aggregate (summing) to a DataFrame. Only a portion of the columns are updated, as I keep statistics for different simulation periods (years).
I read that I may want to avoid creating intermediate DataFrames, thus I work with a NamedTuple.

The MWE below is a bit lengthy, but my question is simple, how can I improve aggregate_yearly_metrics. It does not seem type stable right now…

using DataFrames 

aggregationTuple = (p_volume=0, s_volume=0, floatvar=0.0, intvar=0)
aggDf = DataFrames.DataFrame()

yearlist = collect(1:2)
#define a dict for colnames lookup
colNamesOfAggDf = Dict{Int,Dict{Symbol,Symbol}}()
    for yr in yearlist
        hv = Dict{Symbol,Symbol}()
        for (symb, val) in pairs(aggregationTuple)
            nm = Symbol(string(symb, "_y", yr))
            hv[symb] = nm            
        end
        colNamesOfAggDf[yr] = deepcopy(hv)
    end
    
n=100 
aggDf[!, :iteration] = collect(1:n)
#define dataframe
    for yr in yearlist
        for (tplsymb,dfsymb) in colNamesOfAggDf[yr]
            aggDf[!, dfsymb] = zero(one(getindex(aggregationTuple, tplsymb)):n)
        end
    end

function aggregate_yearly_metrics!(aggDfRowNumber::Int, colNamesOfAggDf, yr::Int, aggDf::DataFrames.DataFrame, aggregationTuple::NamedTuple)
    for k in keys(aggregationTuple)
        v = getindex(aggregationTuple,k)
        dfcolname = colNamesOfAggDf[yr][k]
        aggregate_yearly_metrics_internal!(aggDf[!,dfcolname],aggDfRowNumber,v)
    end
    return nothing
end

function aggregate_yearly_metrics_internal!(dfcol,aggDfRowNumber,v)
    dfcol[aggDfRowNumber] = dfcol[aggDfRowNumber] + v
    return nothing
end

function sim(n_sims,iterations,colNamesOfAggDf,aggDf)
    for i = 1:iterations
        for j = 1:n_sims
            yr = ifelse(rand()<.5,1,2)
            #generate some data (originally stemming form a simulation model)
                aggregationTuple = (p_volume=1, s_volume=rand(Int), floatvar=rand(), intvar=0)
            #aggregate data 
            aggregate_yearly_metrics!(i, colNamesOfAggDf, yr, aggDf, aggregationTuple)
        end
    end 
    return aggDf 
end

aggDf
sim(20_000,n,colNamesOfAggDf,aggDf)

@code_warntype aggregate_yearly_metrics!(3, colNamesOfAggDf, 1, aggDf, aggregationTuple)

using BenchmarkTools
@benchmark sim(20_000,n,colNamesOfAggDf,aggDf)



You cannot make this code type-stable easily even if you switched from DataFrame to NamedTuple as a container.

The reason is that dfcolname = colNamesOfAggDf[yr][k] cannot be resolved by the compiler at compile time.

You potentially could do it with a generated function but then you would have to pass yr as Val and also keep colNamesOfAggDf as a structure which has column names embedded in its type. Then you would need to unroll the loop within a generated function. All this would be quite complex to do correctly.

1 Like

Thanks @bkamins .The aggregation tuple will always have the same structure though.

What if I create an aggDfYr1 and aggDfYr2. Then during the simulation I would need to efficiently add data to one of the rows of aggDfYr1 or aggDfYr2.

How can I add two DataFrames rows: ?

df = DataFrame(a=[1,2],b=[1.1,3.1])
data_to_add_to_rowX = df[1,:] # this would be something else than the first row.
df[1,:] = df[1,:] .+ data_to_add_to_rowX

Put differently: I want an in-place aggregation for my DataFrame, iteratively adding data to a certain row.
I guess combine/groupby would do this, but calling these functions (in an inner loop) for my use case seems overkill.

With DataFrame it will always be type unstable. Use a NamedTuple with STATIC reference to column names (i.e. column names must be explicitly hard-coded in your source code).

1 Like

Thanks. I will try and see if I can make this work.
The namedtuple version is about 10 times faster for me.

using DataFrames 

aggregationTuple = (p_volume=0, s_volume=0, floatvar=0.0, intvar=0)
#aggNt = DataFrames.DataFrame()

yearlist = collect(1:2)
#define a dict for colnames lookup
colNamesOfAggDf = Dict{Int,Dict{Symbol,Symbol}}()
    for yr in yearlist
        hv = Dict{Symbol,Symbol}()
        for (symb, val) in pairs(aggregationTuple)
            nm = Symbol(string(symb, "_y", yr))
            hv[symb] = nm            
        end
        colNamesOfAggDf[yr] = deepcopy(hv)
    end
    
n=100 
aggDf = (iteration = collect(1:n), p_volume_y1 = zeros(Int,n),s_volume_y1 = zeros(Int,n),floatvar_y1 = zeros(Float64,n),intvar_y1 = zeros(Int,n), p_volume_y2 = zeros(Int,n),s_volume_y2 = zeros(Int,n),floatvar_y2 = zeros(Float64,n),intvar_y2 = zeros(Int,n))

function aggregate_yearly_metrics!(aggDfRowNumber::Int, colNamesOfAggDf, yr::Int, aggDf, p_volume,s_volume,floatvar,intvar)
    if yr ==1
        aggDf.p_volume_y1[aggDfRowNumber] = aggDf.p_volume_y1[aggDfRowNumber] + p_volume
        aggDf.s_volume_y1[aggDfRowNumber] = aggDf.s_volume_y1[aggDfRowNumber] + s_volume
        aggDf.floatvar_y1[aggDfRowNumber] = aggDf.floatvar_y1[aggDfRowNumber] + floatvar
        aggDf.intvar_y1[aggDfRowNumber] = aggDf.intvar_y1[aggDfRowNumber] + intvar
    else 
        @assert yr ==2
        aggDf.p_volume_y2[aggDfRowNumber] = aggDf.p_volume_y2[aggDfRowNumber] + p_volume
        aggDf.s_volume_y2[aggDfRowNumber] = aggDf.s_volume_y2[aggDfRowNumber] + s_volume
        aggDf.floatvar_y2[aggDfRowNumber] = aggDf.floatvar_y2[aggDfRowNumber] + floatvar
        aggDf.intvar_y2[aggDfRowNumber] = aggDf.intvar_y2[aggDfRowNumber] + intvar
    end
    return nothing
end

function sim(n_sims,iterations,colNamesOfAggDf,aggDf)
    for i = 1:iterations
        for j = 1:n_sims
            yr = ifelse(rand()<.5,1,2)
            #generate some data (originally stemming form a simulation model)
                aggregationTuple = (p_volume=1, s_volume=rand(Int), floatvar=rand(), intvar=0)
            #aggregate data
            aggregate_yearly_metrics!(i, colNamesOfAggDf, yr, aggDf, rand(Int),rand(Int),rand(),rand(Int))
        end
    end 
    return aggDf 
end

aggDf
sim(20_000,n,colNamesOfAggDf,aggDf)

@code_warntype aggregate_yearly_metrics!(3, colNamesOfAggDf, 1, aggDf,  rand(Int),rand(Int),rand(),rand(Int))

using BenchmarkTools
@benchmark sim(20_000,n,colNamesOfAggDf,aggDf)
#36ms for 20k

This is expected. DataFrames.jl are efficient if you work on whole columns.

1 Like

DataFramesMeta’s @eachrow macro may be helpful here. It works as though you are working with a row, but actually is a for-loop with a function barrier acting on columns, so it’s fast.

1 Like

I always keep forgetting about it :grinning_face_with_smiling_eyes:. Still the column names need to be known in advance to make it work - right?

I’m not sure I follow the code in the example, but you can do interpolation with $

julia> using DataFramesMeta

julia> df = DataFrame(a = rand(10), b = rand(10));

julia> c_str = "c";

julia> @eachrow df begin 
           @newcol $c_str::Vector{Float64}
           $c_str = :a + :b
       end
10Γ—3 DataFrame
 Row β”‚ a         b          c         
     β”‚ Float64   Float64    Float64   
─────┼────────────────────────────────
   1 β”‚ 0.47983   0.554476   1.03431
   2 β”‚ 0.682568  0.532267   1.21484
   3 β”‚ 0.712391  0.869989   1.58238
   4 β”‚ 0.020835  0.0237198  0.0445548
   5 β”‚ 0.89675   0.596729   1.49348
   6 β”‚ 0.784122  0.337654   1.12178
   7 β”‚ 0.149992  0.401756   0.551748
   8 β”‚ 0.823635  0.03158    0.855215
   9 β”‚ 0.107912  0.845048   0.95296
  10 β”‚ 0.139757  0.147285   0.287042

In the OP @eachrow would be within a function and c_str would be from a vector holding many column identifiers.

Ah. You cannot do the following, then

cs = ["c1", "c2", "c3"]
@eachrow df begin 
    for c in cs
        @newcol $c::Vector{Float64} = :a + :b
    end
end

In general that approach is impossible given metaprogramming.

You can do the following

julia> function make_cols(df, cs)
           for c in cs
               @eachrow! df begin 
                   @newcol $c::Vector{Float64}
                   $c = :a + :b + rand()
               end
           end
           df
       end;

julia> make_cols(df, ["c1", "c2"])
10Γ—4 DataFrame
 Row β”‚ a         b          c1        c2       
     β”‚ Float64   Float64    Float64   Float64  
─────┼─────────────────────────────────────────
   1 β”‚ 0.47983   0.554476   1.41844   1.70928
   2 β”‚ 0.682568  0.532267   2.19252   1.59863
   3 β”‚ 0.712391  0.869989   1.96144   2.35732
   4 β”‚ 0.020835  0.0237198  0.23945   0.122738
   5 β”‚ 0.89675   0.596729   2.39555   2.35703
   6 β”‚ 0.784122  0.337654   2.06556   1.38156
   7 β”‚ 0.149992  0.401756   0.876493  1.08352
   8 β”‚ 0.823635  0.03158    1.62591   0.991938
   9 β”‚ 0.107912  0.845048   1.51362   1.16172
  10 β”‚ 0.139757  0.147285   0.505221  0.326369

Though this code is hard to reason about I suppose.

Creating new columns inside @eachrow is something I might want to turn my attention to soon.

1 Like