Simulation Framework with Logging to Tabular Data

question

#1

Is there any existing framework or set of macros for building mutable container types that can be easily logged into tabular data? Here’s a dumb example:

using DataFrames
srand(1)

mutable struct Person
    age::Int64
    id::Int64
end

function update!(person::Person)
    person.age+=1
end

function run_simulation(timesteps::Int64, num_people::Int64)
    results = DataFrame(person_id  = repeat(1:num_people, inner=timesteps),
                        timestep   = repeat(1:timesteps,  outer=num_people),
                        person_age = repeat([0],          inner=num_people*timesteps))
    all_people = [Person(rand(0:80),i) for i in 1:num_people]
    for n in 1:num_people
        for t in 1:timesteps
            update!(all_people[n])                  
            results[(n-1) * timesteps + t, :person_age] = all_people[n].age
        end
    end
    results
end
julia> run_simulation(5,3)
15Γ—3 DataFrames.DataFrame
β”‚ Row β”‚ person_id β”‚ timestep β”‚ person_age β”‚
β”œβ”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ 1   β”‚ 1         β”‚ 1        β”‚ 36         β”‚
β”‚ 2   β”‚ 1         β”‚ 2        β”‚ 37         β”‚
β”‚ 3   β”‚ 1         β”‚ 3        β”‚ 38         β”‚
β”‚ 4   β”‚ 1         β”‚ 4        β”‚ 39         β”‚
β”‚ 5   β”‚ 1         β”‚ 5        β”‚ 40         β”‚
β”‚ 6   β”‚ 2         β”‚ 1        β”‚ 66         β”‚
β”‚ 7   β”‚ 2         β”‚ 2        β”‚ 67         β”‚
β”‚ 8   β”‚ 2         β”‚ 3        β”‚ 68         β”‚
β”‚ 9   β”‚ 2         β”‚ 4        β”‚ 69         β”‚
β”‚ 10  β”‚ 2         β”‚ 5        β”‚ 70         β”‚
β”‚ 11  β”‚ 3         β”‚ 1        β”‚ 78         β”‚
β”‚ 12  β”‚ 3         β”‚ 2        β”‚ 79         β”‚
β”‚ 13  β”‚ 3         β”‚ 3        β”‚ 80         β”‚
β”‚ 14  β”‚ 3         β”‚ 4        β”‚ 81         β”‚
β”‚ 15  β”‚ 3         β”‚ 5        β”‚ 82         β”‚

So this works, but for more complicated examples doing the indexing will be annoying. Using Query.jl, DataFramesMeta, or a similar package, one can slice in more elegantly, but even so it still is a lot of work to express the slicing condition, and has to be constructed manually, something like where table.key_variable_one == this_element.key_variable_one && table.key_variable_two==this_element.key_variable_two etc., set all value variables to this_element.value variables.

I would guess others have already encountered this problem, and there is some package that provides a way to put a macro annotation of a record type regarding which are key variables and which value variables, and then automatically log the variable in tabular form correctly?

A solution might look like

    @loggable mutable struct Person
        @idvar person_id::Int64
        @valuevar person_age::Int64
    end

    ...

    results = @logging DataFrame Person timestep=1:timesteps
    for ...
        @logging for timestep ...
            person = all_people[n]
            update!(person)
            @log results person
        end
    end

That is, the framework would know how to unpack the value variables of the Person object into a row of results matched by the key variables and the timestep, where results is setup to have one additional column to track in this case the time step as defined by the phrase timestep=1:timesteps.

This doesn’t seem like all that original an idea, so I’d be surprised if someone hasn’t already come up with a methodology for it?


#2

I’m doing simulations with Julia which requires logging results of each time steps.

Currently, I’m using JLD package for logging, which doesn’t need to unpack values of Julia type struct.

you can do something like this with JLD

x = Person(age, id)
jldopen("..\\output\\mylog.jld", "w") do file
     write(file, "person_x", x)
end

# You can store any Julia type to JLD like this 
x = Dict(timestep_1=>Person(age, id), timestep_2=>Person(age2, id2) timestep_3=>Person(age3, id3))
jldopen("..\\output\\mylog2.jld", "w") do file
     write(file, "timestep_dict", x)
end

I know It’s not really an elegant system to logging a simulation result. I hope someone has better solution :grinning:


#3

This is fine for saving values, but doesn’t really help with getting them into a table, which isβ€”I believeβ€”the chief goal. It’s a useful step, but the data still need to be parsed and tabularized later. Well, I think they do; the point really is to analyze them, plot them, etc., which seems to be far more convenient when they’re recorded as a table as they then can be fed to many packages that expect tabular data. I suppose if the end of the pipeline were modified to treat a set of parallel sequences of objects as a time series of observations directly, and if all the expectations regarding non-missingness of observations (correct parallel alignment etc.) were there, it could work without using the tabular data format…but that still seems the path of greater resistance!


#4

Wondering if there’s been any movement on this in the past 5 months or so. Doing this sort of simulation with the DataFrames package still appears to be brutally wordy and manual…?


#5

Nice question. I can tell you what I typically do - I do not know if it would suit your needs (I am using 0.6.2 notation):

  1. use IndexedTables setting the variables you want to insert-on as index;
  2. store the whole structure as a column and only unpack it later; in this case it would be something like DataFrame(timestamp=1:timesteps, person=Vector{Vector{Person}}(timesteps)); this will be the fastest way to get data into a DataFrame, but you have to unpack after simulation;
  3. Create DataFrame incrementally either by push! row-by-row or vcat appending a DataFrame representing a single time step.

#6

Hope the following helps:

using IterableTables, NamedTuples, CSVFiles, FileIO

mutable struct Person
    age::Int64
    id::Int64
end

v = Person.([1,2,3], [10, 20, 30])
NamedTuple(p::Person) = @NT(age = p.age, id = p.id)
save("test.csv", NamedTuple.(v))

Note that once you load IterableTables you can easily turn a vector of NamedTuples into any tabular structure (i.e. using DataFrames; DataFrame(NamedTuple.(v))).

Ideally, with NamedTuple in Base, there could be a function (say fieldsandvalues) that automatically does what my method NamedTuple(p::Person) does but in general: it take as input an element of some struct type and returns a NamedTuple with fieldnames as keys and the corresponding values as values, so that you don’t have to define it yourself.

To add extra information, you can do so when building the NamedTuple, for example:

NamedTuple(p::Person, t) = @NT(age = p.age, id = p.id, timestep = p)


#7

DiffEq is setup with the iterable tables interface with allows its solution to automatically convert to any of the tabular formats:

http://docs.juliadiffeq.org/latest/features/io.html#Tabular-Data:-IterableTables-1

You may want something similar.


#8

I’ve read a bit through the IndexedTables docs and I don’t think this really solves the problem. I may be mistaken? If I am not mistaken, my understanding is that this just shifts the burden from [ building a selection statement that correctly identifies the row to update ] to [ making sure that rows are inserted in the right order, the one that would match that selection statement ].

For some applications, there may be a natural order that makes the correctly-ordered-insertion condition easier to ensure than it is to build the slicing expression, but in general, it just seems to be six of one, half-dozen of the other?


#9

You know what, scratch that. Stand by for a (hopefully) better attempt to respond to your and @piever’s suggestions.


#10

OK so @bkamins’s and @piever’s solution seem to share a conceptual strategy that I’m going to try to outline here.

  1. Treat each instance of Person at each timestep as a distinct object. I’m going to call this a PersonTimestep although I should note that this is not (necessarily) an actual Julia object. Rather; this is a conceptual object. The key point is that the vector of data that underlies <t, Person> is the information that I ultimately care about.

    a. One way to realize a PersonTimestep is to use sorting: ensure that if Person x appears before Person y in a list, then the time of x is <= the time of y.
    b. Another way is to use some sort of containing data structure with partial order. For example the DataFrame @bkamins suggests

    includes a timestamp column, so that the combination of timestamp + Person in the associated Vector{Person} at the given timestep together constitute a PersonTimestep.

    b. In @piever’s solution, the timestep is handled as part of the computational strategy, but not as part of the data structure. So the final line

    constitutes another valid implementation strategy for PersonTimestep.

  2. In both cases, the collection you end up with is a sort of giant bucket of PersonTimesteps. Collect all those objects into one giant collection.

  3. Devise a function (in @piever’s case, the NamedTuple constructor + the use of IterableTables's support for NamedTuple) that extracts a row’s worth of data from one of the objects.

  4. Map that function onto the giant collection.

  5. Collect the results as a list of rows, i.e., a table.

  6. Sort ex-post to get the table into the order that you would have had to construct manually. (In both the suggested solutions, the authors use collections that preserve order in step 2, guaranteeing that the results are already in order, hence their lack of explicit instructions to sort. But it occurs to me that so long as each row preserves the correctness of <t,Person(as of t)>, when logged, then no ex ante sorting is required.)

Now having outlined that high-level understanding, there’s an obvious problem with actually implementing it the way I’ve suggested, and to a degree with the way that each author suggested, namely that collecting the giant collection in step 2 could be a huge memory burden, especially if (as is often the case) (a) Person has way more data fields, (b) there are many Persons, and/or Β© there are many timesteps.

Not to worry. This problem can obviously be solved by just processing the data in chunks.

I’ve outlined a MWE below.


using DataFrames, IterableTables, NamedTuples
srand(1)

mutable struct Person
    age::Int64
    id::Int64
end

function update!(person::Person)
    person.age+=1
end

NamedTuple(t::Int64,p::Person) = @NT(timestep=t, id = p.id, age = p.age)

#This is essentially redundant but I'm showing it to abstract an idea
function simulation_row(t::Int64,p::Person)
    NamedTuple(t,p)
end

function log!(logstream, t, p)
    #This could be "write this row to a CSV"
    #But it could also be something lighter weight
    push!(logstream, simulation_row(t,p))
end

function run_simulation!(timesteps::Int64, num_people::Int64, logstream)
    all_people = [Person(rand(0:80),i) for i in 1:num_people]
    for t in 1:timesteps
        for n in 1:num_people
            update!(all_people[n])                  
            log!(logstream,t,all_people[n])
            #The key difference here is if Person is some memory-heavy object,
            #I do not need to keep a collection of every <t,Person> βˆ€ t,
            #but can just extract out the important information to record, 
            #and discard (in this case by mutating) the stale underlying objects.
        end
    end
end

logstream = DataFrame(timestep=Int64[],id=Int64[],age=Int64[])
run_simulation!(5,3,logstream)
julia> logstream
15Γ—3 DataFrames.DataFrame
β”‚ Row β”‚ timestep β”‚ id β”‚ age β”‚
β”œβ”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€
β”‚ 1   β”‚ 1        β”‚ 1  β”‚ 36  β”‚
β”‚ 2   β”‚ 1        β”‚ 2  β”‚ 66  β”‚
β”‚ 3   β”‚ 1        β”‚ 3  β”‚ 78  β”‚
β”‚ 4   β”‚ 2        β”‚ 1  β”‚ 37  β”‚
β”‚ 5   β”‚ 2        β”‚ 2  β”‚ 67  β”‚
β”‚ 6   β”‚ 2        β”‚ 3  β”‚ 79  β”‚
β”‚ 7   β”‚ 3        β”‚ 1  β”‚ 38  β”‚
β”‚ 8   β”‚ 3        β”‚ 2  β”‚ 68  β”‚
β”‚ 9   β”‚ 3        β”‚ 3  β”‚ 80  β”‚
β”‚ 10  β”‚ 4        β”‚ 1  β”‚ 39  β”‚
β”‚ 11  β”‚ 4        β”‚ 2  β”‚ 69  β”‚
β”‚ 12  β”‚ 4        β”‚ 3  β”‚ 81  β”‚
β”‚ 13  β”‚ 5        β”‚ 1  β”‚ 40  β”‚
β”‚ 14  β”‚ 5        β”‚ 2  β”‚ 70  β”‚
β”‚ 15  β”‚ 5        β”‚ 3  β”‚ 82  β”‚

What this has solved is the problem of thinking about concordance of timesteps, ids and the rest of the data. Huzzah!

What it has not solved is the problem of having to manually define this damn table to log all the things and then making sure all the columns are in the right order, a condition that here is not checked and that is represented by the correct match in order essentially among:

NamedTuple(t::Int64,p::Person) = @NT(timestep=t, id = p.id, age = p.age)
logstream = DataFrame(timestep=Int64[],id=Int64[],age=Int64[])

, which match would be tedious to check manually when, as in most use cases, Person actually has several or dozens of fields.

Oh well. I think that problem probably is solvable by macro, essentially by unpacking fieldnames(Person) and then injecting them into a @NT call in one place and into something like a @LogDataFrame in another, where that second macro just builds a dataframe Γ  la logstream above, that unpacks the object in the same order, but that will have to wait for another day.


Get fieldnames and values of `struct` as NamedTuple
#11

I do something very similar when I work with methods that require simulated data (eg nontrivial statistics for indirect inference), but I like to approach this problem differently.

In a nutshell, I organize my code so that I can perform the iteration stepwise, and collect after each step outside that function. AFAICT the DiffEq ecosystem also supports this approach. Eg (very stylized):

for t in 1:timesteps
    update_states!(all_people)
    save_what_I_need!(results, t, all_people)
end

I generally don’t like to intertwine simulation and summary statistics, it makes unit testing much more difficult. Eg in your MWE, you can write a test for update! and run_simulation!, but there is a missing level of granularity (a single simulation step for all the data).


#12

I agree. In an example like yours / in more realistic examples, the minimal unit of recorded data (what I called simulation_row but which should really have a more general name like simulation_recordable_data_unit or something) would be a chunk of a table, not a row of the table, corresponding to rows for all the updated observations in a timestep. This seems to generalize well though–perhaps with some push!ing replaced by some append!ing in some places. (The bigger issue to me is the alignment of columns, which seems like an unnecessary annoyance.)


#13

I am not sure I understand β€” I would just save the objects as is, and worry about processing later. Or if there are many and they are rectangular, I would save in arrays, if ragged, vectors. (Perhaps I am missing something important about your problem).


#14

I just mean that this is annoying:


mutable struct Person
    age::Int64
    id::Int64
    name::String
    weight
    blood_type
    average_resting_heart_rate
    IQ
    ...
    field100
end

NamedTuple(t::Int64,p::Person) = @NT(timestep=t, id = p.id, age = p.age, name = p.name, ................ field100=p.field100)

logstream = DataFrame(timestep=Int64[], id=Int64[], age=Int64[], name = String[], ................ field100=Field100Type[])

#15

I would allocate a single vector, and push!(results, (t, person)). Or even better, find a way to map or collect.


#16

I know i just popped in but what is the purpose for having a type in the first place? If you are storing data in DataFrames I would just use two data frames. One that has all of the information about your person. And the other that is the timeslices / processing results. You can then use join() to merge them together as needed.

The big problem you are trying to solve is an object/relational mapping problem without a framework for doing so. However Julia does let you get the fields out of the type so you can use fieldnames() and getfield() to automate a lot of your mapping from a structured to an unstructured object. Here is some example code, with some modification I bet you can make this work for named tuples and appending rows to a data frame.

Main> struct person
       id
       age
       end

Main> ap = [person(1,34), person(2,23), person(3,64)]
3-element Array{person,1}:
 person(1, 34)
 person(2, 23)
 person(3, 64)

Main> df = DataFrame([[getfield(y, x) for y in ap] for x in fieldnames(p)], fieldnames(p))
3Γ—2 DataFrames.DataFrame
β”‚ Row β”‚ id β”‚ age β”‚
β”œβ”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€
β”‚ 1   β”‚ 1  β”‚ 34  β”‚
β”‚ 2   β”‚ 2  β”‚ 23  β”‚
β”‚ 3   β”‚ 3  β”‚ 64  β”‚

#17

Once 0.7 is out and named tuples are more integrated with the ecosystem I suspect this sort of thing will be easier. Right now there isn’t really any object that represents a row of a dataframe, except for possibly a user defined struct. With named tuples you’ll be able to do, e.g. push!(df, nt). That should make setting up object relational mapping (ORM) type stuff a bit more straightforward.


#18

Not sure if you find it helpful but for deconstructing a custom type into a tabular format, you might want to check out Tabulars.jl.

Pkg.clone("https://github.com/andyferris/Tabulars.jl")
using Tabulars

mutable struct Person
    age::Int64
    id::Int64
    name::String
    weight::Float64
    blood_type::String
    average_resting_heart_rate::Int64
    IQ::Int64
end

function update!(person::Person)
    person.age+=1
end

p = Person(18, 123, "foo", 72.5, "AB", 66, 100);
Main> simresults = Vector{Person}();

Main> for i = 1:10
           update!(p)
           push!(simresults, deepcopy(p))
       end

Main> t = Table(simresults)
7Γ—10 Table:
                              1     2     3     4     5     6     7     8     9     10
                            β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
 age                        β”‚ 19    20    21    22    23    24    25    26    27    28
 id                         β”‚ 123   123   123   123   123   123   123   123   123   123
 name                       β”‚ foo   foo   foo   foo   foo   foo   foo   foo   foo   foo
 weight                     β”‚ 72.5  72.5  72.5  72.5  72.5  72.5  72.5  72.5  72.5  72.5
 blood_type                 β”‚ AB    AB    AB    AB    AB    AB    AB    AB    AB    AB
 average_resting_heart_rate β”‚ 66    66    66    66    66    66    66    66    66    66
 IQ                         β”‚ 100   100   100   100   100   100   100   100   100   100

Main> t = Table(simresults)'
10Γ—7 Table:
      age  id   name  weight  blood_type  average_resting_heart_rate  IQ
    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
 1  β”‚ 19   123  foo   72.5    AB          66                          100
 2  β”‚ 20   123  foo   72.5    AB          66                          100
 3  β”‚ 21   123  foo   72.5    AB          66                          100
 4  β”‚ 22   123  foo   72.5    AB          66                          100
 5  β”‚ 23   123  foo   72.5    AB          66                          100
 6  β”‚ 24   123  foo   72.5    AB          66                          100
 7  β”‚ 25   123  foo   72.5    AB          66                          100
 8  β”‚ 26   123  foo   72.5    AB          66                          100
 9  β”‚ 27   123  foo   72.5    AB          66                          100
 10 β”‚ 28   123  foo   72.5    AB          66                          100

If you have your Table, you can easily index into it like this

Main> t[2, l"age"]
20

#19

Yes, precisely.

I don’t want to work in DataFrames because this isn’t really the right way to encapsulate the idea. I’m open to revising this view, but my naive understanding is that the best way to design research like this is as follows:

  1. Write your simulation code in the way that best reflects the underlying reality it’s meant to represent, making as few assumptions on structure and implementation as possible. (In my case, this means writing structs that represent my model objects, and not worrying about tabular data at all because the very notion of table is intrinsically irrelevant to the simulation.)
  2. Write your analysis code in the way that most suitably interfaces with packages that implement the statistical techniques that you want to do on the data. (In my case this means structuring data in tabular form because the downstream packages, such as Gadfly, play very nicely with tabular data.)
  3. Rely on translators others have written for going from #1 to #2–in particular rather than modifying how #1 and #2 are completed, write them as purely / correctly for their purpose as possible, and then just go between them.

If these are correct principles, then doing the simulation on DataFrames is plainly the wrong answer.

However, I think this is a very good idea:

I also imagine a macro could be adapted to mark certain fields as being included / excluded. But you basically solved the column-to-field consistent mapping problem with that line of code, so (a) thanks, and (b) this is encouraging!


#20

This is cool. A couple of questions:

  1. At a high level, how is a Table different from a DataFrame?
  2. Can you inject a column of timesteps in here?
  3. Can you build the Table row-by-row, again so you don’t have to keep a Vector{Person} through thousands of timesteps and people, which just won’t fit in memory?