Simulation Framework with Logging to Tabular Data

OK so @bkamins’s and @piever’s solution seem to share a conceptual strategy that I’m going to try to outline here.

  1. Treat each instance of Person at each timestep as a distinct object. I’m going to call this a PersonTimestep although I should note that this is not (necessarily) an actual Julia object. Rather; this is a conceptual object. The key point is that the vector of data that underlies <t, Person> is the information that I ultimately care about.

    a. One way to realize a PersonTimestep is to use sorting: ensure that if Person x appears before Person y in a list, then the time of x is <= the time of y.
    b. Another way is to use some sort of containing data structure with partial order. For example the DataFrame @bkamins suggests

    includes a timestamp column, so that the combination of timestamp + Person in the associated Vector{Person} at the given timestep together constitute a PersonTimestep.

    b. In @piever’s solution, the timestep is handled as part of the computational strategy, but not as part of the data structure. So the final line

    constitutes another valid implementation strategy for PersonTimestep.

  2. In both cases, the collection you end up with is a sort of giant bucket of PersonTimesteps. Collect all those objects into one giant collection.

  3. Devise a function (in @piever’s case, the NamedTuple constructor + the use of IterableTables’s support for NamedTuple) that extracts a row’s worth of data from one of the objects.

  4. Map that function onto the giant collection.

  5. Collect the results as a list of rows, i.e., a table.

  6. Sort ex-post to get the table into the order that you would have had to construct manually. (In both the suggested solutions, the authors use collections that preserve order in step 2, guaranteeing that the results are already in order, hence their lack of explicit instructions to sort. But it occurs to me that so long as each row preserves the correctness of <t,Person(as of t)>, when logged, then no ex ante sorting is required.)

Now having outlined that high-level understanding, there’s an obvious problem with actually implementing it the way I’ve suggested, and to a degree with the way that each author suggested, namely that collecting the giant collection in step 2 could be a huge memory burden, especially if (as is often the case) (a) Person has way more data fields, (b) there are many Persons, and/or (c) there are many timesteps.

Not to worry. This problem can obviously be solved by just processing the data in chunks.

I’ve outlined a MWE below.


using DataFrames, IterableTables, NamedTuples
srand(1)

mutable struct Person
    age::Int64
    id::Int64
end

function update!(person::Person)
    person.age+=1
end

NamedTuple(t::Int64,p::Person) = @NT(timestep=t, id = p.id, age = p.age)

#This is essentially redundant but I'm showing it to abstract an idea
function simulation_row(t::Int64,p::Person)
    NamedTuple(t,p)
end

function log!(logstream, t, p)
    #This could be "write this row to a CSV"
    #But it could also be something lighter weight
    push!(logstream, simulation_row(t,p))
end

function run_simulation!(timesteps::Int64, num_people::Int64, logstream)
    all_people = [Person(rand(0:80),i) for i in 1:num_people]
    for t in 1:timesteps
        for n in 1:num_people
            update!(all_people[n])                  
            log!(logstream,t,all_people[n])
            #The key difference here is if Person is some memory-heavy object,
            #I do not need to keep a collection of every <t,Person> ∀ t,
            #but can just extract out the important information to record, 
            #and discard (in this case by mutating) the stale underlying objects.
        end
    end
end

logstream = DataFrame(timestep=Int64[],id=Int64[],age=Int64[])
run_simulation!(5,3,logstream)
julia> logstream
15×3 DataFrames.DataFrame
│ Row │ timestep │ id │ age │
├─────┼──────────┼────┼─────┤
│ 1   │ 1        │ 1  │ 36  │
│ 2   │ 1        │ 2  │ 66  │
│ 3   │ 1        │ 3  │ 78  │
│ 4   │ 2        │ 1  │ 37  │
│ 5   │ 2        │ 2  │ 67  │
│ 6   │ 2        │ 3  │ 79  │
│ 7   │ 3        │ 1  │ 38  │
│ 8   │ 3        │ 2  │ 68  │
│ 9   │ 3        │ 3  │ 80  │
│ 10  │ 4        │ 1  │ 39  │
│ 11  │ 4        │ 2  │ 69  │
│ 12  │ 4        │ 3  │ 81  │
│ 13  │ 5        │ 1  │ 40  │
│ 14  │ 5        │ 2  │ 70  │
│ 15  │ 5        │ 3  │ 82  │

What this has solved is the problem of thinking about concordance of timesteps, ids and the rest of the data. Huzzah!

What it has not solved is the problem of having to manually define this damn table to log all the things and then making sure all the columns are in the right order, a condition that here is not checked and that is represented by the correct match in order essentially among:

NamedTuple(t::Int64,p::Person) = @NT(timestep=t, id = p.id, age = p.age)
logstream = DataFrame(timestep=Int64[],id=Int64[],age=Int64[])

, which match would be tedious to check manually when, as in most use cases, Person actually has several or dozens of fields.

Oh well. I think that problem probably is solvable by macro, essentially by unpacking fieldnames(Person) and then injecting them into a @NT call in one place and into something like a @LogDataFrame in another, where that second macro just builds a dataframe à la logstream above, that unpacks the object in the same order, but that will have to wait for another day.