Saving data tables at different timepoints

RobertGregg · October 26, 2022, 4:18am

I have a data-frame like this:

julia> df = DataFrame(A=[1,2,3], B=[:x,:y,:z], C=[(1,3), (-2,1),(4,4)])
3×3 DataFrame
 Row │ A      B       C       
     │ Int64  Symbol  Tuple…  
─────┼────────────────────────
   1 │     1  x       (1, 3)
   2 │     2  y       (-2, 1)
   3 │     3  z       (4, 4)

that keeps track of important properties in my model. Now say every time my model advances in time this table is modified, either a value is updated or a row is removed/added.

For now the number of columns remain the same as well as their types.

What would be a good way to save this table as it updates over time? In theory I want the ability to retrieve the table at any time step.

I could just save the table at each step but that would lead to huge save files with a lot of redundant data. I think a better way would be to save a list of changes, maybe as a new table with 4 columns (time step, column name, row number, new value)? If anyone’s encountered a similar problem feel free to let me know.

jling · October 26, 2022, 4:49am

RobertGregg · October 26, 2022, 5:01pm

I’ve never heard of dolt, looks interesting! However, I don’t think its quite what I’m looking for (I guess I want something less offline?). Ideally I would have an output from a custom simulate function that results a solution I can query for different time steps like solution(10) or solution[10]. This package looks promising:

But it might not be actively maintained anymore and I would have to figure out how to convert it’s saved data back into a dataframe.

rocco_sprmnt21 · October 29, 2022, 5:53pm

If you don’t have space problems, an easy way could be the following.
The effectiveness of this compared to other modes also depends on the use you have to make of it later

using DataFrames

df = DataFrame(A=[1,2,3], B=[:x,:y,:z], C=[(1,3), (-2,1),(4,4)])


mod=DataFrame(t=0,upd=df)

df1=copy(df)
df1[1,2]=:a
push!(mod,(t=1,upd=df1))


df2=copy(df1)
delete!(df2,2)
push!(mod,(t=2,upd=df2))


df3=copy(df2)
push!(df3,(A=4,B=:w,C=(9,9)))
push!(mod,(t=3,upd=df3))

rocco_sprmnt21 · November 3, 2022, 11:27pm


df = DataFrame(A=[1,2,3], B=[:x,:y,:z], C=[(1,3), (-2,1),(4,4)])
dfh=DataFrame(s=0,id=0,a=missings(NamedTuple,1),r=missings(NamedTuple,1), m=missings(NamedTuple,1))


push!(dfh,(s=1,id=1,a=copy(df[1,:]),r=missing, m=missing))
push!(dfh,(s=2,id=2,a=copy(df[2,:]),r=missing, m=missing))
push!(dfh,(s=3,id=3,a=copy(df[3,:]),r=missing, m=missing))

#modify
function mh(dfh,row,nt)
    step=maximum(dfh.s)+1
    push!(dfh,(s=step,id=row,a=merge(only(dfh[dfh.id.==row,:a]),nt),r=missing, m=only(dfh[dfh.id.==row,:a])))
end

mh(dfh,2,(B=:Y,))

#remove

function rh(dfh,row)
    step=maximum(dfh.s)+1
    push!(dfh,(s=step,id=row,a=missing,r=only(dfh[dfh.id.==row,:a]), m=missing))
end

rh(dfh,1)

#add
function ah(dfh,nt)
    step=maximum(dfh.s)+1
    mid=maximum(dfh.id)+1
    push!(dfh,(s=step,id=mid,a=nt,r=missing, m=missing))
end

ah(dfh,(A=4,B=:w,C=(9,9)))



#-----
step(n)=combine(groupby(dfh[1:n+1,:],:id),:a=>x->[last(x)])

julia> step(6)
5×2 DataFrame
 Row │ id     a_function
     │ Int64  NamedTupl…?
─────┼─────────────────────────────────────
   1 │     0  missing
   2 │     1  missing
   3 │     2  (A = 2, B = :Y, C = (-2, 1))
   4 │     3  (A = 3, B = :z, C = (4, 4))
   5 │     4  (A = 4, B = :w, C = (9, 9))

julia> step(4)
4×2 DataFrame
 Row │ id     a_function
     │ Int64  NamedTupl…?
─────┼─────────────────────────────────────
   1 │     0  missing
   2 │     1  (A = 1, B = :x, C = (1, 3))
   3 │     2  (A = 2, B = :Y, C = (-2, 1))
   4 │     3  (A = 3, B = :z, C = (4, 4))

PS
I ask for clarification on the following error obtained in the first attempt to define the step() function

julia> dfh
7×5 DataFrame
 Row │ s      id     a                             r                            ⋯
     │ Int64  Int64  NamedTup…?                    NamedTup…?                   ⋯
─────┼───────────────────────────────────────────────────────────────────────────
   1 │     0      0  missing                       missing                      ⋯
   2 │     1      1  (A = 1, B = :x, C = (1, 3))   missing
   3 │     2      2  (A = 2, B = :y, C = (-2, 1))  missing
   4 │     3      3  (A = 3, B = :z, C = (4, 4))   missing
   5 │     4      2  (A = 2, B = :Y, C = (-2, 1))  missing                      ⋯
   6 │     5      1  missing                       (A = 1, B = :x, C = (1, 3))   
   7 │     6      4  (A = 4, B = :w, C = (9, 9))   missing
                                                                 1 column omitted

julia> combine(groupby(dfh,:id),:a=>last)
ERROR: ArgumentError: function must return only single-column values, or only multiple-column values

bkamins · November 4, 2022, 8:57am

You get an error, because missing is a single-column value, while named tuple is a multi column value. Mixing single-column and multi-column values is not allowed.

RobertGregg · November 22, 2022, 8:09pm

Unfortunately space will become an issue if the models are large and run for millions of time-steps. My current ad-hoc solution is to use a dictionary of dataframes. Each dictionary key is a column name from the original dataframe (i.e. “A”, “B” and “C”) and the values are new dataframes that look like this:

Row  │ time   position  value
     │ Int64  Int64     Int64
─────┼────────────────────────
   1 │    21        12    500
   2 │    27         3    400
   3 │    52        12    500
   4 │    55         9      0
   5 │    59         9    500
   6 │    71         5    400
   7 │    93        11    500

where “time” is when the change occurred, “position” is where in the column the change occured, and “value” is the new value matching the type of the orginal column.

To get the dataframe back at any given time step, I use this function:

function getState(t, columnDict, initialState)
    
    out = deepcopy(initialState)
    
    #loop through columns and rows in initialState
    for (property, df) in pairs(columnDict)
        for row in eachrow(df)
            if row.time > t
                break
            end
            out[property][row.position] = row.value
        end
    end

    return out
end

It’s a little messy but it works

Topic		Replies	Views
Simulation Framework with Logging to Tabular Data General Usage question	25	2884	June 19, 2018
Questions about saving data to a DataFrame General Usage	1	443	October 11, 2020
Way to store dict of DataFrame tables in a single file? General Usage question	2	1120	September 21, 2018
What is the future of time-indexed dataframes? Data question , dataframes , time-series	3	1348	December 14, 2017
Data structure for convenient access to tabular data General Usage dataframes	5	389	February 13, 2023

Saving data tables at different timepoints

Related topics