Saving data tables at different timepoints

I have a data-frame like this:

julia> df = DataFrame(A=[1,2,3], B=[:x,:y,:z], C=[(1,3), (-2,1),(4,4)])
3×3 DataFrame
 Row │ A      B       C       
     │ Int64  Symbol  Tuple…  
─────┼────────────────────────
   1 │     1  x       (1, 3)
   2 │     2  y       (-2, 1)
   3 │     3  z       (4, 4)

that keeps track of important properties in my model. Now say every time my model advances in time this table is modified, either a value is updated or a row is removed/added.

For now the number of columns remain the same as well as their types.

What would be a good way to save this table as it updates over time? In theory I want the ability to retrieve the table at any time step.

I could just save the table at each step but that would lead to huge save files with a lot of redundant data. I think a better way would be to save a list of changes, maybe as a new table with 4 columns (time step, column name, row number, new value)? If anyone’s encountered a similar problem feel free to let me know.

I’ve never heard of dolt, looks interesting! However, I don’t think its quite what I’m looking for (I guess I want something less offline?). Ideally I would have an output from a custom simulate function that results a solution I can query for different time steps like solution(10) or solution[10]. This package looks promising:

But it might not be actively maintained anymore and I would have to figure out how to convert it’s saved data back into a dataframe.

If you don’t have space problems, an easy way could be the following.
The effectiveness of this compared to other modes also depends on the use you have to make of it later

using DataFrames

df = DataFrame(A=[1,2,3], B=[:x,:y,:z], C=[(1,3), (-2,1),(4,4)])


mod=DataFrame(t=0,upd=df)

df1=copy(df)
df1[1,2]=:a
push!(mod,(t=1,upd=df1))


df2=copy(df1)
delete!(df2,2)
push!(mod,(t=2,upd=df2))


df3=copy(df2)
push!(df3,(A=4,B=:w,C=(9,9)))
push!(mod,(t=3,upd=df3))

df = DataFrame(A=[1,2,3], B=[:x,:y,:z], C=[(1,3), (-2,1),(4,4)])
dfh=DataFrame(s=0,id=0,a=missings(NamedTuple,1),r=missings(NamedTuple,1), m=missings(NamedTuple,1))


push!(dfh,(s=1,id=1,a=copy(df[1,:]),r=missing, m=missing))
push!(dfh,(s=2,id=2,a=copy(df[2,:]),r=missing, m=missing))
push!(dfh,(s=3,id=3,a=copy(df[3,:]),r=missing, m=missing))

#modify
function mh(dfh,row,nt)
    step=maximum(dfh.s)+1
    push!(dfh,(s=step,id=row,a=merge(only(dfh[dfh.id.==row,:a]),nt),r=missing, m=only(dfh[dfh.id.==row,:a])))
end

mh(dfh,2,(B=:Y,))

#remove

function rh(dfh,row)
    step=maximum(dfh.s)+1
    push!(dfh,(s=step,id=row,a=missing,r=only(dfh[dfh.id.==row,:a]), m=missing))
end

rh(dfh,1)

#add
function ah(dfh,nt)
    step=maximum(dfh.s)+1
    mid=maximum(dfh.id)+1
    push!(dfh,(s=step,id=mid,a=nt,r=missing, m=missing))
end

ah(dfh,(A=4,B=:w,C=(9,9)))



#-----
step(n)=combine(groupby(dfh[1:n+1,:],:id),:a=>x->[last(x)])

julia> step(6)
5×2 DataFrame
 Row │ id     a_function
     │ Int64  NamedTupl…?
─────┼─────────────────────────────────────
   1 │     0  missing
   2 │     1  missing
   3 │     2  (A = 2, B = :Y, C = (-2, 1))
   4 │     3  (A = 3, B = :z, C = (4, 4))
   5 │     4  (A = 4, B = :w, C = (9, 9))

julia> step(4)
4×2 DataFrame
 Row │ id     a_function
     │ Int64  NamedTupl…?
─────┼─────────────────────────────────────
   1 │     0  missing
   2 │     1  (A = 1, B = :x, C = (1, 3))
   3 │     2  (A = 2, B = :Y, C = (-2, 1))
   4 │     3  (A = 3, B = :z, C = (4, 4))

PS
I ask for clarification on the following error obtained in the first attempt to define the step() function

julia> dfh
7×5 DataFrame
 Row │ s      id     a                             r                            ⋯
     │ Int64  Int64  NamedTup…?                    NamedTup…?                   ⋯
─────┼───────────────────────────────────────────────────────────────────────────
   1 │     0      0  missing                       missing                      ⋯
   2 │     1      1  (A = 1, B = :x, C = (1, 3))   missing
   3 │     2      2  (A = 2, B = :y, C = (-2, 1))  missing
   4 │     3      3  (A = 3, B = :z, C = (4, 4))   missing
   5 │     4      2  (A = 2, B = :Y, C = (-2, 1))  missing                      ⋯
   6 │     5      1  missing                       (A = 1, B = :x, C = (1, 3))   
   7 │     6      4  (A = 4, B = :w, C = (9, 9))   missing
                                                                 1 column omitted

julia> combine(groupby(dfh,:id),:a=>last)
ERROR: ArgumentError: function must return only single-column values, or only multiple-column values

You get an error, because missing is a single-column value, while named tuple is a multi column value. Mixing single-column and multi-column values is not allowed.

Unfortunately space will become an issue if the models are large and run for millions of time-steps. My current ad-hoc solution is to use a dictionary of dataframes. Each dictionary key is a column name from the original dataframe (i.e. “A”, “B” and “C”) and the values are new dataframes that look like this:

Row  │ time   position  value
     │ Int64  Int64     Int64
─────┼────────────────────────
   1 │    21        12    500
   2 │    27         3    400
   3 │    52        12    500
   4 │    55         9      0
   5 │    59         9    500
   6 │    71         5    400
   7 │    93        11    500

where “time” is when the change occurred, “position” is where in the column the change occured, and “value” is the new value matching the type of the orginal column.

To get the dataframe back at any given time step, I use this function:

function getState(t, columnDict, initialState)
    
    out = deepcopy(initialState)
    
    #loop through columns and rows in initialState
    for (property, df) in pairs(columnDict)
        for row in eachrow(df)
            if row.time > t
                break
            end
            out[property][row.position] = row.value
        end
    end

    return out
end

It’s a little messy but it works :man_shrugging: