OK so @bkamins’s and @piever’s solution seem to share a conceptual strategy that I’m going to try to outline here.
-
Treat each instance of
Person
at each timestep as a distinct object. I’m going to call this aPersonTimestep
although I should note that this is not (necessarily) an actual Julia object. Rather; this is a conceptual object. The key point is that the vector of data that underlies <t, Person> is the information that I ultimately care about.a. One way to realize a
PersonTimestep
is to use sorting: ensure that ifPerson x
appears beforePerson y
in a list, then the time ofx
is<=
the time ofy
.
b. Another way is to use some sort of containing data structure with partial order. For example theDataFrame
@bkamins suggestsincludes a
timestamp
column, so that the combination oftimestamp
+Person
in the associatedVector{Person}
at the given timestep together constitute aPersonTimestep
.b. In @piever’s solution, the timestep is handled as part of the computational strategy, but not as part of the data structure. So the final line
constitutes another valid implementation strategy for
PersonTimestep
. -
In both cases, the collection you end up with is a sort of giant bucket of
PersonTimestep
s. Collect all those objects into one giant collection. -
Devise a function (in @piever’s case, the
NamedTuple
constructor + the use ofIterableTables
’s support forNamedTuple
) that extracts a row’s worth of data from one of the objects. -
Map that function onto the giant collection.
-
Collect the results as a list of rows, i.e., a table.
-
Sort ex-post to get the table into the order that you would have had to construct manually. (In both the suggested solutions, the authors use collections that preserve order in step 2, guaranteeing that the results are already in order, hence their lack of explicit instructions to sort. But it occurs to me that so long as each row preserves the correctness of <t,Person(as of t)>, when logged, then no ex ante sorting is required.)
Now having outlined that high-level understanding, there’s an obvious problem with actually implementing it the way I’ve suggested, and to a degree with the way that each author suggested, namely that collecting the giant collection in step 2 could be a huge memory burden, especially if (as is often the case) (a) Person
has way more data fields, (b) there are many Person
s, and/or (c) there are many timesteps.
Not to worry. This problem can obviously be solved by just processing the data in chunks.
I’ve outlined a MWE below.
using DataFrames, IterableTables, NamedTuples
srand(1)
mutable struct Person
age::Int64
id::Int64
end
function update!(person::Person)
person.age+=1
end
NamedTuple(t::Int64,p::Person) = @NT(timestep=t, id = p.id, age = p.age)
#This is essentially redundant but I'm showing it to abstract an idea
function simulation_row(t::Int64,p::Person)
NamedTuple(t,p)
end
function log!(logstream, t, p)
#This could be "write this row to a CSV"
#But it could also be something lighter weight
push!(logstream, simulation_row(t,p))
end
function run_simulation!(timesteps::Int64, num_people::Int64, logstream)
all_people = [Person(rand(0:80),i) for i in 1:num_people]
for t in 1:timesteps
for n in 1:num_people
update!(all_people[n])
log!(logstream,t,all_people[n])
#The key difference here is if Person is some memory-heavy object,
#I do not need to keep a collection of every <t,Person> ∀ t,
#but can just extract out the important information to record,
#and discard (in this case by mutating) the stale underlying objects.
end
end
end
logstream = DataFrame(timestep=Int64[],id=Int64[],age=Int64[])
run_simulation!(5,3,logstream)
julia> logstream
15×3 DataFrames.DataFrame
│ Row │ timestep │ id │ age │
├─────┼──────────┼────┼─────┤
│ 1 │ 1 │ 1 │ 36 │
│ 2 │ 1 │ 2 │ 66 │
│ 3 │ 1 │ 3 │ 78 │
│ 4 │ 2 │ 1 │ 37 │
│ 5 │ 2 │ 2 │ 67 │
│ 6 │ 2 │ 3 │ 79 │
│ 7 │ 3 │ 1 │ 38 │
│ 8 │ 3 │ 2 │ 68 │
│ 9 │ 3 │ 3 │ 80 │
│ 10 │ 4 │ 1 │ 39 │
│ 11 │ 4 │ 2 │ 69 │
│ 12 │ 4 │ 3 │ 81 │
│ 13 │ 5 │ 1 │ 40 │
│ 14 │ 5 │ 2 │ 70 │
│ 15 │ 5 │ 3 │ 82 │
What this has solved is the problem of thinking about concordance of timesteps, ids and the rest of the data. Huzzah!
What it has not solved is the problem of having to manually define this damn table to log all the things and then making sure all the columns are in the right order, a condition that here is not checked and that is represented by the correct match in order essentially among:
NamedTuple(t::Int64,p::Person) = @NT(timestep=t, id = p.id, age = p.age)
logstream = DataFrame(timestep=Int64[],id=Int64[],age=Int64[])
, which match would be tedious to check manually when, as in most use cases, Person
actually has several or dozens of fields.
Oh well. I think that problem probably is solvable by macro, essentially by unpacking fieldnames(Person)
and then injecting them into a @NT
call in one place and into something like a @LogDataFrame
in another, where that second macro just builds a dataframe à la logstream
above, that unpacks the object in the same order, but that will have to wait for another day.