Simulation Framework with Logging to Tabular Data

I think this technique is interesting. Perhaps if someone can edit the to_named_tuple piece this could work:


using DataFrames, NamedTuples

#gentup(struct_T) = NamedTuple{( fieldnames(struct_T)...,),
#                                Tuple{(fieldtype(struct_T,i) for i=1:fieldcount(struct_T))...}}
#
#@generated function to_named_tuple_generated(x)
#    nt = Expr(:quote, gentup(x))
#    tup = Expr(:tuple)
#    for i=1:fieldcount(x)
#        push!(tup.args, :(getfield(x, $i)) )
#    end
#    return :($nt($tup))
#end
#        
#        

#to_named_tuple_naive(p) = (; (v=>getfield(p, v) for v in fieldnames(typeof(p)))...)

        

mutable struct Person
    age::Int64
    id::Int64
    name::String
    weight::Float64
    blood_type::String
    average_resting_heart_rate::Int64
    IQ::Int64
end

function update!(person::Person)
    person.age += 1
    person.weight += rand() - 0.5
end


function logging_df(a_struct)
        DataFrame(prepend!(map(x -> fieldtype(a_struct, x), 
                               fieldnames(a_struct)
                              ),
                           [Int64]
                          ),
                  prepend!(fieldnames(a_struct),
                           [:timestep]
                          ),
                  0)
end
    
p = Person(18, 123, "foo", 72.5, "AB", 66, 100);
    
function log!(simresults, t, p)
        push!(simresults, prepend!(to_named_tuple(p),[t]))
end
    
all_people = [p]
    
function run_simulation!(timesteps::Int64, all_people, simresults)
    for t in 1:timesteps
        map(update!,all_people)
        log!(simresults,t,all_people)
    end
end

simresults = logging_df(Person)
run_simulation!(10,all_people,simresults)
simresults

I’m definitely not the best person to talk about Tabulars but I think that the main difference from DataFrames is that it is unopinionated about the underlying format of your data. It tries to ingest whatever you throw at it and turn it into a lightweight Table type that provides a consistent and efficient interface to interact with your data. Ferris presents it nicely in his short JuliaCon presentation.

Note that it is a work in progress. I don’t think there is an easy way to add columns or build it up row by row (yet).

I think the proper way would be to skip the named tuples and directly write a dataframe.

I think I would go for a @generated push!(some_df, foostruct, fieldmap), where the fieldmap is a value-type of a tuple of pairs (i,j) where i is the number of the field in foostruct and j is the number of the corresponding column in df. Note that this is still slow because dataframes store the set of columns as Vector{Any}; one can maybe get rid of the type instability by a type-assert, so your extra cost for convenience is only a lookup of the type-tag (afaik there is no type-assume that does Vector{Any}[i]::type_t without checking the type).

Also note that you would trigger an extra compilation for each dataframe you use (but this is probably cheaper than looking up the col-index in a dict for each field and each object you push).

Yeah–I don’t actually expect this to be performant at the moment. But I figure that if the

can be solved generically at least with some slow proof of concept, then whenever someone finally gets fast DataFrames and however one deals with cutting out the NamedTuple middle-man, some performant solution will appear on the horizon.

I’ll try to toy around with this more after I get 0.7 running.

If I get some time this week I might take a wack at writing a quick library with the method foobar posted in the other thread and some other convenience functions to go from a structured to a number of unstructured formats and back again.

I never did get 0.7 running locally, so couldn’t try this out. But I have some more thoughts on this and on related questions. I expect to post these in the next week or so.

In the meantime, I wonder if anyone is familiar with any other work that has been done in this vein. In general I’m finding that I’m spending a lot of time thinking through high-level simulation issues when I just want to focus on designing and interpreting my actual simulations. I would have to think somebody wiser has already tackled the generic versions of these issues.

That is, the issue of logging is one high-level simulation issue. Another is result interpretation with consistent matching from simulation parameters into results. Generically, what is a way to keep track of the “match” so that whenever I “view” results (in the form of a statistic, DataFrame, plot, whatever) the parameters that generated that result are “carried along”–i.e. displayed along with it? In some sense what I want is a simple way to say "If I did the work of defining the function

SimulateOnce : Parameters \to Results,

where Parameters is literally any space and Results is also literally any space, and a class of functions

InterpretResults = \{ f : Results \to Interpretations \} ,

where Interpretations again is any space (e.g. if I’m simulating population growth, then this might be \mathbb N for the number of people alive at the end of the simulation, or it might be a visual image of a map of all the living entities in the 7th period. The point is that it’s literally anything),

then I would like

  1. a metafunction

    SimulateRepeatedly : P \subseteq Parameters \mapsto \{ (p,r) : p \in P \text{ and } r = SimulateOnce(p) \}

  2. Some “magical” function that interacts with the metafunction and with all the functions InterpretResults to nicely give me the set of meta-results

    \{(p,f(r)) : (p,r) \in SimulateRepeatedly(P) \}

in some sort of neat and still easily interpretable but also cross-comparable way, i.e., I can compare the results of the changing $p$s on the $r$s.

As a concrete example, in the population simulation case, if I’m able to generate a map of living creatures as an interpretative function of a simulation result, then I’d like to be able to quickly see a table with parametrizations on the top and the maps on the bottom of a 2xN grid of charts so I can interpret the effects of parameter changes on this feature of the simulation.

What I’m finding is that I spend more time writing these parameter-interpretation mappings than I do working on the simulation itself, and I find that annoying and wonder if there’s a better way. I’d also accept an informal proof / argument that “there’s no way around this because the generality of things one might want to ask is too high” or something along those lines.

(Does this make any sense? If not, I’ll try to revise it when I’m more lucid.)

Thanks!

1 Like