What is the right way to store/record the results of this nested simulation?

Tetrakai · December 15, 2023, 12:27am

I wrote this after seeing the responses to my last question.

From the responses I can tell my problem was not understood.

What this simulation does is play a rudimentary “game” between pairs of two teams once per “week” and record the player results and league standings. Since the outcome is stochastic this is repeated multiple times. In actuality, the ratings/parameters of the players would change each simulation step and the output would be compared to other stats, but I left that out for now as it isn’t important to storing the results.

It only requires:

using DataFrames, Random

Then these utility functions to generate players/teams/schedule, the “game engine”, etc:

Summary

# Generate random player names
function gen_pnames(nplayers)
    return [randstring('A':'Z', 3) for _ in 1:nplayers]
end

# Generate random team names
function gen_tnames(nteams)
    return string.(collect('A':'Z')[1:nteams])
end

# Make schedule where teams plays each other once
function makeschedule(teamnames)
    nteams = length(teamnames)

    if isodd(nteams)
        teamnames = [teamnames; "Bye"]
        nteams    = nteams + 1
    end

    df  = DataFrame(reshape(teamnames, Int(nteams/2), 2), :auto)
    tmp = deepcopy(df)
    nw  = nteams - 1
    nr  = size(df, 1)

    sched    = Vector{DataFrame}(undef, nw)
    sched[1] = df

    # Rotate df clockwise around [1, 1]
    for w in 2:nw
        tmp[1, 1]  = df[1, 1]
        tmp[1, 2]  = df[2, 1]
        tmp[nr, 1] = df[nr, 2]

        if(nr > 2)
            for i in reverse(3:nr)
                tmp[i - 1, 1] = df[i, 1]
            end
        end

        for i in 2:nr
            tmp[i, 2] = df[i - 1, 2]
        end

        sched[w] = df = deepcopy(tmp)
    end
    return sched
end


# Generate rosters for each team
function gen_rosters(teamnames, nplayers)
    nteams  = length(teamnames)
    rosters = Vector{DataFrame}(undef, nteams)
    for i in 1:nteams
        rosters[i] = DataFrame(pname = gen_pnames(nplayers),
                               skill = rand(1:10, nplayers))
    end

    rosters = [teamnames, rosters]
    return rosters
end

# Play a game between two teams (simply subtract each player skill)
function playgame(team1, team2, teamnames, rosters)
    idx1 = findfirst(teamnames .== team1)
    idx2 = findfirst(teamnames .== team2)


    gameresult = rosters[2][idx1].skill - rosters[2][idx2].skill
    gameresult = gameresult + rand(-3:3, length(gameresult))

    if sum(gameresult) == 0
        teamresult = [0, 0]
    else
        teamresult = ifelse(sum(gameresult) > 0, [1, 0], [0, 1])
    end

    teams  = DataFrame(teams = [team1, team2], winner = teamresult)
    stats1 = DataFrame(pname = rosters[2][idx1].pname,
                       pts   = gameresult)
    stats2 = DataFrame(pname = rosters[2][idx2].pname,
                       pts   = -gameresult)

    return (teams = teams, stats1 = stats1, stats2 = stats2)
end

# Accumulates wins/losses/etc in the standings
function update_lgstats(lgstats, gamesres)
    for i in eachindex(gamesres)
        idx1 = findfirst(lgstats.Team .== gamesres[i][1].teams[1])
        idx2 = findfirst(lgstats.Team .== gamesres[i][1].teams[2])

        win = gamesres[i][1].winner

        lgstats.Games[[idx1, idx2]] += [1, 1]
        if sum(win) == 0
            lgstats.Draw[[idx1, idx2]] += [1, 1]
        end
        lgstats.Win[[idx1, idx2]]  += win
        lgstats.Loss[[idx1, idx2]] += reverse(win)
    end
    return lgstats
end

# Reset standings to zeros
function reset_lgstats(teamnames)
    nteams  = length(teamnames)
    lgstats = DataFrame(Team  = teamnames,
                        Games = zeros(Int32, nteams),
                        Win   = zeros(Int32, nteams),
                        Loss  = zeros(Int32, nteams),
                        Draw  = zeros(Int32, nteams))
    return lgstats
end

# Placeholder that returns a random number
function calcmetrics(simres)
    return DataFrame(rmse = rand(1:100))
end

And here would be what is in the main simulation function:

nteams   = 4
nplayers = 2
nsim     = 1
nrep     = 3

teamnames = gen_tnames(nteams)
rosters   = gen_rosters(teamnames, nplayers)
sched     = makeschedule(teamnames)
nweeks    = nteams - 1

allres = Vector{}(undef, nsim)
for s in 1:nsim
    # params = genparams()
    simres = Vector{}(undef, nrep)

    # Sim same season nrep times
    for r in 1:nrep
        lgstats   = reset_lgstats(teamnames)
        seasonres = Vector{}(undef, nweeks)

        # Play Season
        for w in 1:nweeks
            games  = deepcopy(sched[w])
            chkbye = Matrix(games) .== "Bye"
            if sum(chkbye) == 1
                idx_bye = findall(chkbye)[1][1]
                deleteat!(games, idx_bye)
            end
            ngames   = nrow(games)
            gamesres = Vector{NamedTuple}(undef, ngames)

            # Play games for week w
            for g in 1:ngames
                team1 = games[g, 1]
                team2 = games[g, 2]
                gamesres[g] = playgame(team1, team2, teamnames, rosters)
            end
            lgstats      = update_lgstats(lgstats, gamesres)
            seasonres[w] = [gamesres, deepcopy(lgstats)]

        end

        simres[r] = seasonres
    end

    metrics   = calcmetrics(simres)
    allres[s] = [simres, metrics]
end

What I want to know is how to implement a better way to structure allres than this vector of vectors of dataframes and tuples I have going on. Also, anything else weird I am doing.

At first I figured that was too much for here, but maybe someone will take a look so why not see. My impression from reading around on this forum and stackexchange, etc is that this is a common issue not really addressed by the MWEs we normally see. So perhaps it could be generally useful.

Thanks to anyone who takes a look!

jar1 · December 15, 2023, 12:45am

There are varying philosophies on this. One reasonable way is to make a single big dataframe with sim and rep columns and some data columns, and just query out the rows you want whenever you want to analyze data from a particular sim or rep.

Tetrakai · December 15, 2023, 1:08am

Yea, I guess seperate ones for players and teams. That seemed too simple to me, so there must be a tradeoff. I suspect there are more efficient ways to do it. But it is good to know that would be considered ok. Thanks!

I am really just trying to learn some best/acceptable practices here before I get used to bad ones.

jar1 · December 15, 2023, 1:20am

You might be interested in The “tidy data” section of the “R for data science” book.

Tetrakai · December 15, 2023, 1:48am

I’m coming from R, so am familiar with Tidyverse. I assumed in Julia that approach doesn’t scale here either, but maybe I am wrong. In your experience if I run the above simulation 1e6 times or whatever would I notice a difference in speed and memory if I try the “tidy” approach?

jar1 · December 15, 2023, 1:55am

I would set up the data structures in the nice tidy way first and see how that goes naively. If you experience performance problems, report those on Discourse with your code and people can help you speed it up (after looking at Performance Tips · The Julia Language). Julia tends to be significantly faster than R in general, and there are always ways to speed it up, but imho it can be helpful to start with a clean organization before starting to muck it up with performance.

Tetrakai · December 15, 2023, 2:17am

Thanks, I like this strategy. If everyone starts from the same easy to understand structure then it will be easier to provide general answers to improve it. It makes a lot of sense.

I am confident that I will push this to the edge, and so need to optimize for at least memory in the future though.

ufechner7 · December 15, 2023, 5:10am

Just as answer to your reformulated question:

This is the way I store my simulation results: Examples · KiteUtils.jl

Just records (structs) in an array. And only scalar values or fixed size vectors in the struct. I would call that 2.5 dimensional.

Advantages:

simple and fast

Disadvantages:

you must know in advance the maximal number of entries per log file.
adding fields requires to modify the code

I used more complex structures for storing results in the past, like different types of messages at a different rate in one log file, but analyzing the result became a nightmare.

There is no need to store results in the form you want to see them.

Tetrakai · December 15, 2023, 4:25pm

Thanks, this does look very similar to what I am looking for.