Adding a NamedTuple to DataFrame row

bernhard · September 20, 2021, 9:22am

I have a simulation model that iteratively produces NamedTuples which I want to aggregate (summing) to a DataFrame. Only a portion of the columns are updated, as I keep statistics for different simulation periods (years).
I read that I may want to avoid creating intermediate DataFrames, thus I work with a NamedTuple.

The MWE below is a bit lengthy, but my question is simple, how can I improve aggregate_yearly_metrics. It does not seem type stable right now…

using DataFrames 

aggregationTuple = (p_volume=0, s_volume=0, floatvar=0.0, intvar=0)
aggDf = DataFrames.DataFrame()

yearlist = collect(1:2)
#define a dict for colnames lookup
colNamesOfAggDf = Dict{Int,Dict{Symbol,Symbol}}()
    for yr in yearlist
        hv = Dict{Symbol,Symbol}()
        for (symb, val) in pairs(aggregationTuple)
            nm = Symbol(string(symb, "_y", yr))
            hv[symb] = nm            
        end
        colNamesOfAggDf[yr] = deepcopy(hv)
    end
    
n=100 
aggDf[!, :iteration] = collect(1:n)
#define dataframe
    for yr in yearlist
        for (tplsymb,dfsymb) in colNamesOfAggDf[yr]
            aggDf[!, dfsymb] = zero(one(getindex(aggregationTuple, tplsymb)):n)
        end
    end

function aggregate_yearly_metrics!(aggDfRowNumber::Int, colNamesOfAggDf, yr::Int, aggDf::DataFrames.DataFrame, aggregationTuple::NamedTuple)
    for k in keys(aggregationTuple)
        v = getindex(aggregationTuple,k)
        dfcolname = colNamesOfAggDf[yr][k]
        aggregate_yearly_metrics_internal!(aggDf[!,dfcolname],aggDfRowNumber,v)
    end
    return nothing
end

function aggregate_yearly_metrics_internal!(dfcol,aggDfRowNumber,v)
    dfcol[aggDfRowNumber] = dfcol[aggDfRowNumber] + v
    return nothing
end

function sim(n_sims,iterations,colNamesOfAggDf,aggDf)
    for i = 1:iterations
        for j = 1:n_sims
            yr = ifelse(rand()<.5,1,2)
            #generate some data (originally stemming form a simulation model)
                aggregationTuple = (p_volume=1, s_volume=rand(Int), floatvar=rand(), intvar=0)
            #aggregate data 
            aggregate_yearly_metrics!(i, colNamesOfAggDf, yr, aggDf, aggregationTuple)
        end
    end 
    return aggDf 
end

aggDf
sim(20_000,n,colNamesOfAggDf,aggDf)

@code_warntype aggregate_yearly_metrics!(3, colNamesOfAggDf, 1, aggDf, aggregationTuple)

using BenchmarkTools
@benchmark sim(20_000,n,colNamesOfAggDf,aggDf)

bkamins · September 20, 2021, 9:54am

You cannot make this code type-stable easily even if you switched from DataFrame to NamedTuple as a container.

The reason is that dfcolname = colNamesOfAggDf[yr][k] cannot be resolved by the compiler at compile time.

You potentially could do it with a generated function but then you would have to pass yr as Val and also keep colNamesOfAggDf as a structure which has column names embedded in its type. Then you would need to unroll the loop within a generated function. All this would be quite complex to do correctly.

bernhard · September 20, 2021, 9:58am

Thanks @bkamins .The aggregation tuple will always have the same structure though.

What if I create an aggDfYr1 and aggDfYr2. Then during the simulation I would need to efficiently add data to one of the rows of aggDfYr1 or aggDfYr2.

How can I add two DataFrames rows: ?

df = DataFrame(a=[1,2],b=[1.1,3.1])
data_to_add_to_rowX = df[1,:] # this would be something else than the first row.
df[1,:] = df[1,:] .+ data_to_add_to_rowX

bernhard · September 20, 2021, 10:02am

Put differently: I want an in-place aggregation for my DataFrame, iteratively adding data to a certain row.
I guess combine/groupby would do this, but calling these functions (in an inner loop) for my use case seems overkill.

bkamins · September 20, 2021, 10:10am

With DataFrame it will always be type unstable. Use a NamedTuple with STATIC reference to column names (i.e. column names must be explicitly hard-coded in your source code).

bernhard · September 20, 2021, 11:31am

Thanks. I will try and see if I can make this work.
The namedtuple version is about 10 times faster for me.

using DataFrames 

aggregationTuple = (p_volume=0, s_volume=0, floatvar=0.0, intvar=0)
#aggNt = DataFrames.DataFrame()

yearlist = collect(1:2)
#define a dict for colnames lookup
colNamesOfAggDf = Dict{Int,Dict{Symbol,Symbol}}()
    for yr in yearlist
        hv = Dict{Symbol,Symbol}()
        for (symb, val) in pairs(aggregationTuple)
            nm = Symbol(string(symb, "_y", yr))
            hv[symb] = nm            
        end
        colNamesOfAggDf[yr] = deepcopy(hv)
    end
    
n=100 
aggDf = (iteration = collect(1:n), p_volume_y1 = zeros(Int,n),s_volume_y1 = zeros(Int,n),floatvar_y1 = zeros(Float64,n),intvar_y1 = zeros(Int,n), p_volume_y2 = zeros(Int,n),s_volume_y2 = zeros(Int,n),floatvar_y2 = zeros(Float64,n),intvar_y2 = zeros(Int,n))

function aggregate_yearly_metrics!(aggDfRowNumber::Int, colNamesOfAggDf, yr::Int, aggDf, p_volume,s_volume,floatvar,intvar)
    if yr ==1
        aggDf.p_volume_y1[aggDfRowNumber] = aggDf.p_volume_y1[aggDfRowNumber] + p_volume
        aggDf.s_volume_y1[aggDfRowNumber] = aggDf.s_volume_y1[aggDfRowNumber] + s_volume
        aggDf.floatvar_y1[aggDfRowNumber] = aggDf.floatvar_y1[aggDfRowNumber] + floatvar
        aggDf.intvar_y1[aggDfRowNumber] = aggDf.intvar_y1[aggDfRowNumber] + intvar
    else 
        @assert yr ==2
        aggDf.p_volume_y2[aggDfRowNumber] = aggDf.p_volume_y2[aggDfRowNumber] + p_volume
        aggDf.s_volume_y2[aggDfRowNumber] = aggDf.s_volume_y2[aggDfRowNumber] + s_volume
        aggDf.floatvar_y2[aggDfRowNumber] = aggDf.floatvar_y2[aggDfRowNumber] + floatvar
        aggDf.intvar_y2[aggDfRowNumber] = aggDf.intvar_y2[aggDfRowNumber] + intvar
    end
    return nothing
end

function sim(n_sims,iterations,colNamesOfAggDf,aggDf)
    for i = 1:iterations
        for j = 1:n_sims
            yr = ifelse(rand()<.5,1,2)
            #generate some data (originally stemming form a simulation model)
                aggregationTuple = (p_volume=1, s_volume=rand(Int), floatvar=rand(), intvar=0)
            #aggregate data
            aggregate_yearly_metrics!(i, colNamesOfAggDf, yr, aggDf, rand(Int),rand(Int),rand(),rand(Int))
        end
    end 
    return aggDf 
end

aggDf
sim(20_000,n,colNamesOfAggDf,aggDf)

@code_warntype aggregate_yearly_metrics!(3, colNamesOfAggDf, 1, aggDf,  rand(Int),rand(Int),rand(),rand(Int))

using BenchmarkTools
@benchmark sim(20_000,n,colNamesOfAggDf,aggDf)
#36ms for 20k

bkamins · September 20, 2021, 12:04pm

This is expected. DataFrames.jl are efficient if you work on whole columns.

pdeffebach · September 20, 2021, 12:24pm

DataFramesMeta’s @eachrow macro may be helpful here. It works as though you are working with a row, but actually is a for-loop with a function barrier acting on columns, so it’s fast.

bkamins · September 20, 2021, 1:03pm

I always keep forgetting about it . Still the column names need to be known in advance to make it work - right?

pdeffebach · September 20, 2021, 1:47pm

I’m not sure I follow the code in the example, but you can do interpolation with $

julia> using DataFramesMeta

julia> df = DataFrame(a = rand(10), b = rand(10));

julia> c_str = "c";

julia> @eachrow df begin 
           @newcol $c_str::Vector{Float64}
           $c_str = :a + :b
       end
10×3 DataFrame
 Row │ a         b          c         
     │ Float64   Float64    Float64   
─────┼────────────────────────────────
   1 │ 0.47983   0.554476   1.03431
   2 │ 0.682568  0.532267   1.21484
   3 │ 0.712391  0.869989   1.58238
   4 │ 0.020835  0.0237198  0.0445548
   5 │ 0.89675   0.596729   1.49348
   6 │ 0.784122  0.337654   1.12178
   7 │ 0.149992  0.401756   0.551748
   8 │ 0.823635  0.03158    0.855215
   9 │ 0.107912  0.845048   0.95296
  10 │ 0.139757  0.147285   0.287042

bkamins · September 20, 2021, 1:54pm

In the OP @eachrow would be within a function and c_str would be from a vector holding many column identifiers.

pdeffebach · September 20, 2021, 1:59pm

Ah. You cannot do the following, then

cs = ["c1", "c2", "c3"]
@eachrow df begin 
    for c in cs
        @newcol $c::Vector{Float64} = :a + :b
    end
end

In general that approach is impossible given metaprogramming.

You can do the following

julia> function make_cols(df, cs)
           for c in cs
               @eachrow! df begin 
                   @newcol $c::Vector{Float64}
                   $c = :a + :b + rand()
               end
           end
           df
       end;

julia> make_cols(df, ["c1", "c2"])
10×4 DataFrame
 Row │ a         b          c1        c2       
     │ Float64   Float64    Float64   Float64  
─────┼─────────────────────────────────────────
   1 │ 0.47983   0.554476   1.41844   1.70928
   2 │ 0.682568  0.532267   2.19252   1.59863
   3 │ 0.712391  0.869989   1.96144   2.35732
   4 │ 0.020835  0.0237198  0.23945   0.122738
   5 │ 0.89675   0.596729   2.39555   2.35703
   6 │ 0.784122  0.337654   2.06556   1.38156
   7 │ 0.149992  0.401756   0.876493  1.08352
   8 │ 0.823635  0.03158    1.62591   0.991938
   9 │ 0.107912  0.845048   1.51362   1.16172
  10 │ 0.139757  0.147285   0.505221  0.326369

Though this code is hard to reason about I suppose.

Creating new columns inside @eachrow is something I might want to turn my attention to soon.

Topic		Replies	Views
Construct DataFrame From Uneven Named Tuples General Usage dataframes	18	1088	August 20, 2023
Convert NamedTuple with matrix to DataFrame General Usage question , dataframes	7	325	March 9, 2024
Accessing a column value from DataFrameRow allocates Performance dataframes	10	847	March 7, 2022
Initializing a dataframe New to Julia	23	10875	March 15, 2020
How to create `DataFrame` from using NamedTuple keys as column names Data	4	2755	August 11, 2019

Adding a NamedTuple to DataFrame row

Related topics