How to best preallocate nested tuples of dataframes to store simulation results?

I’m trying to run some sport game simulations and programmatically store the results in the most efficient manner. Even if not strictly necessary for my purpose, I am trying to learn how to do right.

Right now my order of concern here is: Memory > Readability > Speed. I came up with this, which is roughly how I would start out in R using a nested list:

###########################
#### Desired Structure ####
###########################
# -sim1
#   -metrics = {repeat(Float32, 3), Bool}(1x4) 
# 	-lgstats
# 		-rep1 = {Int16, String15, repeat(Int32, 8)}(4x10)
# 		-rep2 = {Int16, String15, repeat(Int32, 8)}(4x10)
#       -[...]
# 	-team1
# 		-pinfo  = {String15, Int8, String3, String3}(30x4)
# 		-params = {Int8}(30x6)
# 		-pstats
# 			-rep1 = {Int16}(30x7)
# 			-rep2 = {Int16}(30x7)
#           -[...]
# 	-team2
# 		-[...]
#   -team3
#       -[...]
#   -team4
#       -[...]
# -sim2
# 	-[...]
# -sim3
#   -[...]

Each simulation (sim) uses a unique set of player parameters/ratings (params), and is replicated (rep) multiple times per sim step (the output is stochastic).

You can see there are various metrics (rmse, etc) and league-level stats (lgstats). Then for each team there is player info (pinfo) like name/etc, along with the player params and stats (pstats).

After much messing around, I managed to create this:

using DataFrames, InlineStrings

# Variable
nsim  = 3
nrep  = 2
nteam = 4

# Constant (see desired structure)
nmetrics = 4
nplayer  = 30
nlgstats = 10
nparams  = 6
npstats  = 7

# Team-level Tuple
teamres = Tuple{DataFrame,              # pinfo
                DataFrame,              # params
                NTuple{nrep, DataFrame} # pstats,  rep[1:nrep]
                }                       # team1

# Simulation-level Tuple
simres = Tuple{DataFrame,               # nmetrics
               NTuple{nrep, DataFrame}, # lgstats, rep[1:nrep]
               NTuple{nteam, teamres}   # teams,   team[1:nteam]
               }                        # sim1

# Final Result
allres = NTuple{nsim, simres}

When run, it seems to work:

Summary
julia> # Team-level Tuple
       teamres = Tuple{DataFrame,              # pinfo
                       DataFrame,              # params
                       NTuple{nrep, DataFrame} # pstats,  rep[1:nrep]
                       }                       # team1      
Tuple{DataFrame, DataFrame, Tuple{DataFrame, DataFrame}}

       
julia> # Simulation-level Tuple
       simres = Tuple{DataFrame,               # nmetrics
                      NTuple{nrep, DataFrame}, # lgstats, rep[1:nrep]
                      NTuple{nteam, teamres}   # teams,   team[1:nteam]
                      }                        # sim1       
Tuple{DataFrame, Tuple{DataFrame, DataFrame}, NTuple{4, Tuple{DataFrame, DataFrame, Tuple{DataFrame, DataFrame}}}}

      
julia> # Final Result
      allres = NTuple{nsim, simres}
Tuple{Tuple{DataFrame, Tuple{DataFrame, DataFrame}, NTuple{4, Tuple{DataFrame, DataFrame, Tuple{DataFrame, DataFrame}}}}, Tuple{DataFrame, Tuple{DataFrame, DataFrame}, NTuple{4, Tuple{DataFrame, DataFrame, Tuple{DataFrame, DataFrame}}}}, Tuple{DataFrame, Tuple{DataFrame, DataFrame}, NTuple{4, Tuple{DataFrame, DataFrame, Tuple{DataFrame, DataFrame}}}}}

But I cannot figure out how to:

  1. Name the elements of these tuples (I could not get NamedTuple{} to work here).

  2. Preallocate dataframes of the desired sizes and types.

  3. Programmatically add my data to this DataType I created.

And perhaps this is the totally wrong way to go about it. Please tell me if so, because I don’t know what I’m doing here. But even then, I would be interested in knowing how to make this method work (or why it won’t).

I would try to adapt to the DrWatson.jl workflow, they sorted out all these details about saving simulation results, parameters, etc:

https://juliadynamics.github.io/DrWatson.jl/dev

1 Like

Thanks, I am currently trying out Dr. Watson for this project and plan to do exactly that. It seems very well thought out.

I guess I should clarify that these are the simulation results I will save. Each of these simulation steps takes <1s so saving after each step would generate an insane number of files and/or be a bottleneck.

If you want high speed you could have a look at Arrow.jl and StructArrays.jl… I use them together for logging of flight data 20 times per second…

They use memory mapped arrays… Much faster than DataFrames, but less convenient to use…

Well, if speed is not your major concern you could use Arrow also together with DataFrames.jl .

Arrow stores the data in binary form on disk (much more compact than CSV) and in the same form in memory, so no conversion needed when loading or saving…

1 Like

Both look very interesting, thanks. I’ll have to play around with them. But this feels like getting ahead of myself.

I guess you are saying try a structarray of nested namedtuples containing dataframes instead of what I did? I am not manipulating this data yet, only generating it and storing it in memory. Will I still benefit from this and Arrow.jl?

The thing is I couldn’t even figure out how to do normal nested namedtuples of dataframes. I think you are assuming I know way more Julia than I do… My problem is much more basic.

Why would you want to use nested tuples of DataFrames? A DataFrame is like a database. Normally you have only one of them where you store everything. For nested structures HDF5 might be a better choice.

1 Like

My data has a hierarchical structure with multiple types. So tuples of DataFrames fit the bill for an easily understandable structure to me (if I could get them named).

To be clear, I have no need to read/write anything from/to disk efficiently. I am just trying to organize my data in an understandable way and keep it in memory while not wasting ram.

No. Arrow would allow you to store one 2D table, which could be one DataFrame. You can use StructArrays instead of DataFrames.

I think you cannot pre-allocate the data structure you currently have in mind. You can pre-allocate arrays or tables.

Arrow would allow you to store one 2D table, which could be one DataFrame.

It would have to be multiple tables, and (I think) would require way more effort to get out what I want later. The way I wrote it, this would be very easy. However, I would want to try that if it used way less memory or was much faster.

I think you cannot pre-allocate the data structure you currently have in mind. You can pre-allocate arrays or tables.

Possibly, I have no idea. I was able to make some kind of DataType that looks like what I want though. Is there a reason why Julia doesn’t allow this?

Tuples are immutable. You can preallocate the first initialisation, but after that you will need brand new allocations.

For preallocations, I use closures but there might be better alternatives.

That was my understanding. I don’t plan on modifying this data later. But I can still iteratively fill in the tuples right? Just not change the values later.

Of the top of my head, no certain. But I think not.

Why not use a normal struct?

Because I have no idea what I am doing. That is my question. How would I create a struct to hold data organized in the way I laid out above, if that is possible?

Edit:
Alternatively, why wouldn’t I want to create a struct that looks like that?

No, you cannot. The content of a tuple has to be known when it is created. That is the meaning of “immutable”. There might be work-arounds, but I doubt that they work for DataFrames…

Tuples are intended to be used with small, basic data types, not with large structures like DataFrames…

Tuples in Julia are an immutable collection of distinct values of same or different datatypes separated by commas. Tuples are more like arrays in Julia except that arrays only take values of similar datatypes. The values of a tuple can not be changed because tuples are immutable. Tuples are a heterogeneous collection of values.

Yea, after playing with it more I saw I would need to have the parent type be a StructArray or something. Then the simres tuple could be filled all at once at the end of each step. I think.

Anyway from the responses it is clear I am doing something odd here. But I also find it hard to believe I am the first person to use julia who wants to store data in memory organized something like this.

I mean it is such a standard, logical thing to do in R you’ll find R 101 examples of it all over. Eg: https://statisticsglobe.com/create-nested-list-in-r

Here’s what that first example could look like in Julia:

julia> list_1 = [12:20, ('a':'z')[16:-1:11], ["yyyy"]]
3-element Vector{AbstractVector}:
 12:20
 'p':-1:'k'
 ["yyyy"]

julia> list_2 = [4:8, ('a':'z')[7:-1:1], ["xxx"]]
3-element Vector{AbstractVector}:
 4:8
 'g':-1:'a'
 ["xxx"]

julia> list_3 = [["Another"], ["list (that's actually a vector of vectors)"], ["in Julia"]]
3-element Vector{Vector{String}}:
 ["Another"]
 ["list (that's actually a vector of vectors)"]
 ["in Julia"]

julia> my_nested_list_1 = [list_1, list_2, list_3]
3-element Vector{Vector{AbstractVector}}:
 [12:20, 'p':-1:'k', ["yyyy"]]
 [4:8, 'g':-1:'a', ["xxx"]]
 [["Another"], ["list (that's actually a vector of vectors)"], ["in Julia"]]

julia> my_nested_list_1[2][2]
'g':-1:'a'
1 Like

I think you should compose structs of simpler types. I’m not following exactly your objective, but something like below:

struct Sim
    metrics::Vector{Metric}
    lgstats::Vector{Rep}
    teams::Vector{Team}
end
struct Team
    pinfo::PInfo
    params::Params
    pstats::Vector{Rep}
end
struct PInfo
    a::String
    b::Int64
    c::String
    d::String
end

Your collections should be Vector if you need to mutate (update values or add to the collection). Otherwise Tuple collections will be faster.

1 Like

If you have scalar fields that you want to mutate, then use a mutable struct. Just add the keyword mutable in front of your struct.

Thanks, you seem to have understood.

So when I tried playing with structs, I couldn’t figure out how to avoid writing out every name and type. Eg,

namevec = [a, b, c, d]
typevec = [String; Int64; repeat([String], 2)]

# Is there a function like this?
PInfo = create_struct(namevec, typevec)

And I’m guessing if I want to preallocate for eg 4 teams then I would use:

struct Sim
    metrics::Vector{Metric}
    lgstats::Vector{Rep}
    teams::Vector{Team}(undef, 4)
end

I will play around with it but this seems to be what I want.