Simulation Framework with Logging to Tabular Data

DiffEq is setup with the iterable tables interface with allows its solution to automatically convert to any of the tabular formats:

http://docs.juliadiffeq.org/latest/features/io.html#Tabular-Data:-IterableTables-1

You may want something similar.

1 Like

I’ve read a bit through the IndexedTables docs and I don’t think this really solves the problem. I may be mistaken? If I am not mistaken, my understanding is that this just shifts the burden from [ building a selection statement that correctly identifies the row to update ] to [ making sure that rows are inserted in the right order, the one that would match that selection statement ].

For some applications, there may be a natural order that makes the correctly-ordered-insertion condition easier to ensure than it is to build the slicing expression, but in general, it just seems to be six of one, half-dozen of the other?

You know what, scratch that. Stand by for a (hopefully) better attempt to respond to your and @piever’s suggestions.

OK so @bkamins’s and @piever’s solution seem to share a conceptual strategy that I’m going to try to outline here.

  1. Treat each instance of Person at each timestep as a distinct object. I’m going to call this a PersonTimestep although I should note that this is not (necessarily) an actual Julia object. Rather; this is a conceptual object. The key point is that the vector of data that underlies <t, Person> is the information that I ultimately care about.

    a. One way to realize a PersonTimestep is to use sorting: ensure that if Person x appears before Person y in a list, then the time of x is <= the time of y.
    b. Another way is to use some sort of containing data structure with partial order. For example the DataFrame @bkamins suggests

    includes a timestamp column, so that the combination of timestamp + Person in the associated Vector{Person} at the given timestep together constitute a PersonTimestep.

    b. In @piever’s solution, the timestep is handled as part of the computational strategy, but not as part of the data structure. So the final line

    constitutes another valid implementation strategy for PersonTimestep.

  2. In both cases, the collection you end up with is a sort of giant bucket of PersonTimesteps. Collect all those objects into one giant collection.

  3. Devise a function (in @piever’s case, the NamedTuple constructor + the use of IterableTables’s support for NamedTuple) that extracts a row’s worth of data from one of the objects.

  4. Map that function onto the giant collection.

  5. Collect the results as a list of rows, i.e., a table.

  6. Sort ex-post to get the table into the order that you would have had to construct manually. (In both the suggested solutions, the authors use collections that preserve order in step 2, guaranteeing that the results are already in order, hence their lack of explicit instructions to sort. But it occurs to me that so long as each row preserves the correctness of <t,Person(as of t)>, when logged, then no ex ante sorting is required.)

Now having outlined that high-level understanding, there’s an obvious problem with actually implementing it the way I’ve suggested, and to a degree with the way that each author suggested, namely that collecting the giant collection in step 2 could be a huge memory burden, especially if (as is often the case) (a) Person has way more data fields, (b) there are many Persons, and/or (c) there are many timesteps.

Not to worry. This problem can obviously be solved by just processing the data in chunks.

I’ve outlined a MWE below.


using DataFrames, IterableTables, NamedTuples
srand(1)

mutable struct Person
    age::Int64
    id::Int64
end

function update!(person::Person)
    person.age+=1
end

NamedTuple(t::Int64,p::Person) = @NT(timestep=t, id = p.id, age = p.age)

#This is essentially redundant but I'm showing it to abstract an idea
function simulation_row(t::Int64,p::Person)
    NamedTuple(t,p)
end

function log!(logstream, t, p)
    #This could be "write this row to a CSV"
    #But it could also be something lighter weight
    push!(logstream, simulation_row(t,p))
end

function run_simulation!(timesteps::Int64, num_people::Int64, logstream)
    all_people = [Person(rand(0:80),i) for i in 1:num_people]
    for t in 1:timesteps
        for n in 1:num_people
            update!(all_people[n])                  
            log!(logstream,t,all_people[n])
            #The key difference here is if Person is some memory-heavy object,
            #I do not need to keep a collection of every <t,Person> ∀ t,
            #but can just extract out the important information to record, 
            #and discard (in this case by mutating) the stale underlying objects.
        end
    end
end

logstream = DataFrame(timestep=Int64[],id=Int64[],age=Int64[])
run_simulation!(5,3,logstream)
julia> logstream
15×3 DataFrames.DataFrame
│ Row │ timestep │ id │ age │
├─────┼──────────┼────┼─────┤
│ 1   │ 1        │ 1  │ 36  │
│ 2   │ 1        │ 2  │ 66  │
│ 3   │ 1        │ 3  │ 78  │
│ 4   │ 2        │ 1  │ 37  │
│ 5   │ 2        │ 2  │ 67  │
│ 6   │ 2        │ 3  │ 79  │
│ 7   │ 3        │ 1  │ 38  │
│ 8   │ 3        │ 2  │ 68  │
│ 9   │ 3        │ 3  │ 80  │
│ 10  │ 4        │ 1  │ 39  │
│ 11  │ 4        │ 2  │ 69  │
│ 12  │ 4        │ 3  │ 81  │
│ 13  │ 5        │ 1  │ 40  │
│ 14  │ 5        │ 2  │ 70  │
│ 15  │ 5        │ 3  │ 82  │

What this has solved is the problem of thinking about concordance of timesteps, ids and the rest of the data. Huzzah!

What it has not solved is the problem of having to manually define this damn table to log all the things and then making sure all the columns are in the right order, a condition that here is not checked and that is represented by the correct match in order essentially among:

NamedTuple(t::Int64,p::Person) = @NT(timestep=t, id = p.id, age = p.age)
logstream = DataFrame(timestep=Int64[],id=Int64[],age=Int64[])

, which match would be tedious to check manually when, as in most use cases, Person actually has several or dozens of fields.

Oh well. I think that problem probably is solvable by macro, essentially by unpacking fieldnames(Person) and then injecting them into a @NT call in one place and into something like a @LogDataFrame in another, where that second macro just builds a dataframe à la logstream above, that unpacks the object in the same order, but that will have to wait for another day.

I do something very similar when I work with methods that require simulated data (eg nontrivial statistics for indirect inference), but I like to approach this problem differently.

In a nutshell, I organize my code so that I can perform the iteration stepwise, and collect after each step outside that function. AFAICT the DiffEq ecosystem also supports this approach. Eg (very stylized):

for t in 1:timesteps
    update_states!(all_people)
    save_what_I_need!(results, t, all_people)
end

I generally don’t like to intertwine simulation and summary statistics, it makes unit testing much more difficult. Eg in your MWE, you can write a test for update! and run_simulation!, but there is a missing level of granularity (a single simulation step for all the data).

1 Like

I agree. In an example like yours / in more realistic examples, the minimal unit of recorded data (what I called simulation_row but which should really have a more general name like simulation_recordable_data_unit or something) would be a chunk of a table, not a row of the table, corresponding to rows for all the updated observations in a timestep. This seems to generalize well though–perhaps with some push!ing replaced by some append!ing in some places. (The bigger issue to me is the alignment of columns, which seems like an unnecessary annoyance.)

I am not sure I understand — I would just save the objects as is, and worry about processing later. Or if there are many and they are rectangular, I would save in arrays, if ragged, vectors. (Perhaps I am missing something important about your problem).

I just mean that this is annoying:


mutable struct Person
    age::Int64
    id::Int64
    name::String
    weight
    blood_type
    average_resting_heart_rate
    IQ
    ...
    field100
end

NamedTuple(t::Int64,p::Person) = @NT(timestep=t, id = p.id, age = p.age, name = p.name, ................ field100=p.field100)

logstream = DataFrame(timestep=Int64[], id=Int64[], age=Int64[], name = String[], ................ field100=Field100Type[])

I would allocate a single vector, and push!(results, (t, person)). Or even better, find a way to map or collect.

I know i just popped in but what is the purpose for having a type in the first place? If you are storing data in DataFrames I would just use two data frames. One that has all of the information about your person. And the other that is the timeslices / processing results. You can then use join() to merge them together as needed.

The big problem you are trying to solve is an object/relational mapping problem without a framework for doing so. However Julia does let you get the fields out of the type so you can use fieldnames() and getfield() to automate a lot of your mapping from a structured to an unstructured object. Here is some example code, with some modification I bet you can make this work for named tuples and appending rows to a data frame.

Main> struct person
       id
       age
       end

Main> ap = [person(1,34), person(2,23), person(3,64)]
3-element Array{person,1}:
 person(1, 34)
 person(2, 23)
 person(3, 64)

Main> df = DataFrame([[getfield(y, x) for y in ap] for x in fieldnames(p)], fieldnames(p))
3×2 DataFrames.DataFrame
│ Row │ id │ age │
├─────┼────┼─────┤
│ 1   │ 1  │ 34  │
│ 2   │ 2  │ 23  │
│ 3   │ 3  │ 64  │
1 Like

Once 0.7 is out and named tuples are more integrated with the ecosystem I suspect this sort of thing will be easier. Right now there isn’t really any object that represents a row of a dataframe, except for possibly a user defined struct. With named tuples you’ll be able to do, e.g. push!(df, nt). That should make setting up object relational mapping (ORM) type stuff a bit more straightforward.

Not sure if you find it helpful but for deconstructing a custom type into a tabular format, you might want to check out Tabulars.jl.

Pkg.clone("https://github.com/andyferris/Tabulars.jl")
using Tabulars

mutable struct Person
    age::Int64
    id::Int64
    name::String
    weight::Float64
    blood_type::String
    average_resting_heart_rate::Int64
    IQ::Int64
end

function update!(person::Person)
    person.age+=1
end

p = Person(18, 123, "foo", 72.5, "AB", 66, 100);
Main> simresults = Vector{Person}();

Main> for i = 1:10
           update!(p)
           push!(simresults, deepcopy(p))
       end

Main> t = Table(simresults)
7×10 Table:
                              1     2     3     4     5     6     7     8     9     10
                            ┌───────────────────────────────────────────────────────────
 age                        │ 19    20    21    22    23    24    25    26    27    28
 id                         │ 123   123   123   123   123   123   123   123   123   123
 name                       │ foo   foo   foo   foo   foo   foo   foo   foo   foo   foo
 weight                     │ 72.5  72.5  72.5  72.5  72.5  72.5  72.5  72.5  72.5  72.5
 blood_type                 │ AB    AB    AB    AB    AB    AB    AB    AB    AB    AB
 average_resting_heart_rate │ 66    66    66    66    66    66    66    66    66    66
 IQ                         │ 100   100   100   100   100   100   100   100   100   100

Main> t = Table(simresults)'
10×7 Table:
      age  id   name  weight  blood_type  average_resting_heart_rate  IQ
    ┌────────────────────────────────────────────────────────────────────
 1  │ 19   123  foo   72.5    AB          66                          100
 2  │ 20   123  foo   72.5    AB          66                          100
 3  │ 21   123  foo   72.5    AB          66                          100
 4  │ 22   123  foo   72.5    AB          66                          100
 5  │ 23   123  foo   72.5    AB          66                          100
 6  │ 24   123  foo   72.5    AB          66                          100
 7  │ 25   123  foo   72.5    AB          66                          100
 8  │ 26   123  foo   72.5    AB          66                          100
 9  │ 27   123  foo   72.5    AB          66                          100
 10 │ 28   123  foo   72.5    AB          66                          100

If you have your Table, you can easily index into it like this

Main> t[2, l"age"]
20
1 Like

Yes, precisely.

I don’t want to work in DataFrames because this isn’t really the right way to encapsulate the idea. I’m open to revising this view, but my naive understanding is that the best way to design research like this is as follows:

  1. Write your simulation code in the way that best reflects the underlying reality it’s meant to represent, making as few assumptions on structure and implementation as possible. (In my case, this means writing structs that represent my model objects, and not worrying about tabular data at all because the very notion of table is intrinsically irrelevant to the simulation.)
  2. Write your analysis code in the way that most suitably interfaces with packages that implement the statistical techniques that you want to do on the data. (In my case this means structuring data in tabular form because the downstream packages, such as Gadfly, play very nicely with tabular data.)
  3. Rely on translators others have written for going from #1 to #2–in particular rather than modifying how #1 and #2 are completed, write them as purely / correctly for their purpose as possible, and then just go between them.

If these are correct principles, then doing the simulation on DataFrames is plainly the wrong answer.

However, I think this is a very good idea:

I also imagine a macro could be adapted to mark certain fields as being included / excluded. But you basically solved the column-to-field consistent mapping problem with that line of code, so (a) thanks, and (b) this is encouraging!

This is cool. A couple of questions:

  1. At a high level, how is a Table different from a DataFrame?
  2. Can you inject a column of timesteps in here?
  3. Can you build the Table row-by-row, again so you don’t have to keep a Vector{Person} through thousands of timesteps and people, which just won’t fit in memory?

I think this technique is interesting. Perhaps if someone can edit the to_named_tuple piece this could work:


using DataFrames, NamedTuples

#gentup(struct_T) = NamedTuple{( fieldnames(struct_T)...,),
#                                Tuple{(fieldtype(struct_T,i) for i=1:fieldcount(struct_T))...}}
#
#@generated function to_named_tuple_generated(x)
#    nt = Expr(:quote, gentup(x))
#    tup = Expr(:tuple)
#    for i=1:fieldcount(x)
#        push!(tup.args, :(getfield(x, $i)) )
#    end
#    return :($nt($tup))
#end
#        
#        

#to_named_tuple_naive(p) = (; (v=>getfield(p, v) for v in fieldnames(typeof(p)))...)

        

mutable struct Person
    age::Int64
    id::Int64
    name::String
    weight::Float64
    blood_type::String
    average_resting_heart_rate::Int64
    IQ::Int64
end

function update!(person::Person)
    person.age += 1
    person.weight += rand() - 0.5
end


function logging_df(a_struct)
        DataFrame(prepend!(map(x -> fieldtype(a_struct, x), 
                               fieldnames(a_struct)
                              ),
                           [Int64]
                          ),
                  prepend!(fieldnames(a_struct),
                           [:timestep]
                          ),
                  0)
end
    
p = Person(18, 123, "foo", 72.5, "AB", 66, 100);
    
function log!(simresults, t, p)
        push!(simresults, prepend!(to_named_tuple(p),[t]))
end
    
all_people = [p]
    
function run_simulation!(timesteps::Int64, all_people, simresults)
    for t in 1:timesteps
        map(update!,all_people)
        log!(simresults,t,all_people)
    end
end

simresults = logging_df(Person)
run_simulation!(10,all_people,simresults)
simresults

I’m definitely not the best person to talk about Tabulars but I think that the main difference from DataFrames is that it is unopinionated about the underlying format of your data. It tries to ingest whatever you throw at it and turn it into a lightweight Table type that provides a consistent and efficient interface to interact with your data. Ferris presents it nicely in his short JuliaCon presentation.

Note that it is a work in progress. I don’t think there is an easy way to add columns or build it up row by row (yet).

I think the proper way would be to skip the named tuples and directly write a dataframe.

I think I would go for a @generated push!(some_df, foostruct, fieldmap), where the fieldmap is a value-type of a tuple of pairs (i,j) where i is the number of the field in foostruct and j is the number of the corresponding column in df. Note that this is still slow because dataframes store the set of columns as Vector{Any}; one can maybe get rid of the type instability by a type-assert, so your extra cost for convenience is only a lookup of the type-tag (afaik there is no type-assume that does Vector{Any}[i]::type_t without checking the type).

Also note that you would trigger an extra compilation for each dataframe you use (but this is probably cheaper than looking up the col-index in a dict for each field and each object you push).

Yeah–I don’t actually expect this to be performant at the moment. But I figure that if the

can be solved generically at least with some slow proof of concept, then whenever someone finally gets fast DataFrames and however one deals with cutting out the NamedTuple middle-man, some performant solution will appear on the horizon.

I’ll try to toy around with this more after I get 0.7 running.

If I get some time this week I might take a wack at writing a quick library with the method foobar posted in the other thread and some other convenience functions to go from a structured to a number of unstructured formats and back again.

I never did get 0.7 running locally, so couldn’t try this out. But I have some more thoughts on this and on related questions. I expect to post these in the next week or so.

In the meantime, I wonder if anyone is familiar with any other work that has been done in this vein. In general I’m finding that I’m spending a lot of time thinking through high-level simulation issues when I just want to focus on designing and interpreting my actual simulations. I would have to think somebody wiser has already tackled the generic versions of these issues.

That is, the issue of logging is one high-level simulation issue. Another is result interpretation with consistent matching from simulation parameters into results. Generically, what is a way to keep track of the “match” so that whenever I “view” results (in the form of a statistic, DataFrame, plot, whatever) the parameters that generated that result are “carried along”–i.e. displayed along with it? In some sense what I want is a simple way to say "If I did the work of defining the function

SimulateOnce : Parameters \to Results,

where Parameters is literally any space and Results is also literally any space, and a class of functions

InterpretResults = \{ f : Results \to Interpretations \} ,

where Interpretations again is any space (e.g. if I’m simulating population growth, then this might be \mathbb N for the number of people alive at the end of the simulation, or it might be a visual image of a map of all the living entities in the 7th period. The point is that it’s literally anything),

then I would like

  1. a metafunction

    SimulateRepeatedly : P \subseteq Parameters \mapsto \{ (p,r) : p \in P \text{ and } r = SimulateOnce(p) \}

  2. Some “magical” function that interacts with the metafunction and with all the functions InterpretResults to nicely give me the set of meta-results

    \{(p,f(r)) : (p,r) \in SimulateRepeatedly(P) \}

in some sort of neat and still easily interpretable but also cross-comparable way, i.e., I can compare the results of the changing $p$s on the $r$s.

As a concrete example, in the population simulation case, if I’m able to generate a map of living creatures as an interpretative function of a simulation result, then I’d like to be able to quickly see a table with parametrizations on the top and the maps on the bottom of a 2xN grid of charts so I can interpret the effects of parameter changes on this feature of the simulation.

What I’m finding is that I spend more time writing these parameter-interpretation mappings than I do working on the simulation itself, and I find that annoying and wonder if there’s a better way. I’d also accept an informal proof / argument that “there’s no way around this because the generality of things one might want to ask is too high” or something along those lines.

(Does this make any sense? If not, I’ll try to revise it when I’m more lucid.)

Thanks!

1 Like