What is the right way to store/record the results of this nested simulation?

I wrote this after seeing the responses to my last question.

From the responses I can tell my problem was not understood.

What this simulation does is play a rudimentary “game” between pairs of two teams once per “week” and record the player results and league standings. Since the outcome is stochastic this is repeated multiple times. In actuality, the ratings/parameters of the players would change each simulation step and the output would be compared to other stats, but I left that out for now as it isn’t important to storing the results.

It only requires:

using DataFrames, Random

Then these utility functions to generate players/teams/schedule, the “game engine”, etc:

Summary
# Generate random player names
function gen_pnames(nplayers)
return [randstring('A':'Z', 3) for _ in 1:nplayers]
end

# Generate random team names
function gen_tnames(nteams)
return string.(collect('A':'Z')[1:nteams])
end

# Make schedule where teams plays each other once
function makeschedule(teamnames)
nteams = length(teamnames)

if isodd(nteams)
teamnames = [teamnames; "Bye"]
nteams    = nteams + 1
end

df  = DataFrame(reshape(teamnames, Int(nteams/2), 2), :auto)
tmp = deepcopy(df)
nw  = nteams - 1
nr  = size(df, 1)

sched    = Vector{DataFrame}(undef, nw)
sched[1] = df

# Rotate df clockwise around [1, 1]
for w in 2:nw
tmp[1, 1]  = df[1, 1]
tmp[1, 2]  = df[2, 1]
tmp[nr, 1] = df[nr, 2]

if(nr > 2)
for i in reverse(3:nr)
tmp[i - 1, 1] = df[i, 1]
end
end

for i in 2:nr
tmp[i, 2] = df[i - 1, 2]
end

sched[w] = df = deepcopy(tmp)
end
return sched
end

# Generate rosters for each team
function gen_rosters(teamnames, nplayers)
nteams  = length(teamnames)
rosters = Vector{DataFrame}(undef, nteams)
for i in 1:nteams
rosters[i] = DataFrame(pname = gen_pnames(nplayers),
skill = rand(1:10, nplayers))
end

rosters = [teamnames, rosters]
return rosters
end

# Play a game between two teams (simply subtract each player skill)
function playgame(team1, team2, teamnames, rosters)
idx1 = findfirst(teamnames .== team1)
idx2 = findfirst(teamnames .== team2)

gameresult = rosters[2][idx1].skill - rosters[2][idx2].skill
gameresult = gameresult + rand(-3:3, length(gameresult))

if sum(gameresult) == 0
teamresult = [0, 0]
else
teamresult = ifelse(sum(gameresult) > 0, [1, 0], [0, 1])
end

teams  = DataFrame(teams = [team1, team2], winner = teamresult)
stats1 = DataFrame(pname = rosters[2][idx1].pname,
pts   = gameresult)
stats2 = DataFrame(pname = rosters[2][idx2].pname,
pts   = -gameresult)

return (teams = teams, stats1 = stats1, stats2 = stats2)
end

# Accumulates wins/losses/etc in the standings
function update_lgstats(lgstats, gamesres)
for i in eachindex(gamesres)
idx1 = findfirst(lgstats.Team .== gamesres[i][1].teams[1])
idx2 = findfirst(lgstats.Team .== gamesres[i][1].teams[2])

win = gamesres[i][1].winner

lgstats.Games[[idx1, idx2]] += [1, 1]
if sum(win) == 0
lgstats.Draw[[idx1, idx2]] += [1, 1]
end
lgstats.Win[[idx1, idx2]]  += win
lgstats.Loss[[idx1, idx2]] += reverse(win)
end
return lgstats
end

# Reset standings to zeros
function reset_lgstats(teamnames)
nteams  = length(teamnames)
lgstats = DataFrame(Team  = teamnames,
Games = zeros(Int32, nteams),
Win   = zeros(Int32, nteams),
Loss  = zeros(Int32, nteams),
Draw  = zeros(Int32, nteams))
return lgstats
end

# Placeholder that returns a random number
function calcmetrics(simres)
return DataFrame(rmse = rand(1:100))
end

And here would be what is in the main simulation function:

nteams   = 4
nplayers = 2
nsim     = 1
nrep     = 3

teamnames = gen_tnames(nteams)
rosters   = gen_rosters(teamnames, nplayers)
sched     = makeschedule(teamnames)
nweeks    = nteams - 1

allres = Vector{}(undef, nsim)
for s in 1:nsim
# params = genparams()
simres = Vector{}(undef, nrep)

# Sim same season nrep times
for r in 1:nrep
lgstats   = reset_lgstats(teamnames)
seasonres = Vector{}(undef, nweeks)

# Play Season
for w in 1:nweeks
games  = deepcopy(sched[w])
chkbye = Matrix(games) .== "Bye"
if sum(chkbye) == 1
idx_bye = findall(chkbye)[1][1]
deleteat!(games, idx_bye)
end
ngames   = nrow(games)
gamesres = Vector{NamedTuple}(undef, ngames)

# Play games for week w
for g in 1:ngames
team1 = games[g, 1]
team2 = games[g, 2]
gamesres[g] = playgame(team1, team2, teamnames, rosters)
end
lgstats      = update_lgstats(lgstats, gamesres)
seasonres[w] = [gamesres, deepcopy(lgstats)]

end

simres[r] = seasonres
end

metrics   = calcmetrics(simres)
allres[s] = [simres, metrics]
end

What I want to know is how to implement a better way to structure allres than this vector of vectors of dataframes and tuples I have going on. Also, anything else weird I am doing.

At first I figured that was too much for here, but maybe someone will take a look so why not see. My impression from reading around on this forum and stackexchange, etc is that this is a common issue not really addressed by the MWEs we normally see. So perhaps it could be generally useful.

Thanks to anyone who takes a look!

There are varying philosophies on this. One reasonable way is to make a single big dataframe with sim and rep columns and some data columns, and just query out the rows you want whenever you want to analyze data from a particular sim or rep.

3 Likes

Yea, I guess seperate ones for players and teams. That seemed too simple to me, so there must be a tradeoff. I suspect there are more efficient ways to do it. But it is good to know that would be considered ok. Thanks!

I am really just trying to learn some best/acceptable practices here before I get used to bad ones.

You might be interested in The “tidy data” section of the “R for data science” book.

1 Like

I’m coming from R, so am familiar with Tidyverse. I assumed in Julia that approach doesn’t scale here either, but maybe I am wrong. In your experience if I run the above simulation 1e6 times or whatever would I notice a difference in speed and memory if I try the “tidy” approach?

I would set up the data structures in the nice tidy way first and see how that goes naively. If you experience performance problems, report those on Discourse with your code and people can help you speed it up (after looking at Performance Tips · The Julia Language). Julia tends to be significantly faster than R in general, and there are always ways to speed it up, but imho it can be helpful to start with a clean organization before starting to muck it up with performance.

1 Like

Thanks, I like this strategy. If everyone starts from the same easy to understand structure then it will be easier to provide general answers to improve it. It makes a lot of sense.

I am confident that I will push this to the edge, and so need to optimize for at least memory in the future though.

This is the way I store my simulation results: Examples · KiteUtils.jl

Just records (structs) in an array. And only scalar values or fixed size vectors in the struct. I would call that 2.5 dimensional.

• simple and fast