How to best preallocate nested tuples of dataframes to store simulation results?

I doubt it, but maybe there are macros for this. It seems unlikely, since a dynamic type like you propose would not allow the type to be compiled ahead of time for dispatch.

In general, you need to define the types (structs) explicitly first. Then create variables as instances of those types.

julia> myinfo = PInfo("Hello", 2, "John", "Snow")
PInfo("Hello", 2, "John", "Snow")

julia> typeof(myinfo)
PInfo

No, again the allocation will occur once the type is invoked not in the definition of the type.

Define a simple type Test:

struct Test
    v::Vector{Int64}
end

It is possible to create an undef instance as you propose, but it is verbose and the data will be junk (not necessarily zeros):

julia> t = Test(Vector{Int64}(undef, 4))
Test([0, 0, 0, 0])

I would probably just:

julia> t = Test(zeros(4))
Test([0, 0, 0, 0])

Then, either way, you can start filling the values.

julia> t.v[3] = 5
5

julia> t
Test([0, 0, 5, 0])

Well in principle the function could just create a string, right?

PInfo_str = create_struct_string(structname, namevec, typevec)
"Pinfo = struct PInfo
           a::String
           b::Int64
           c::String
           d::String
         end"

Then write it (without quotes) to structs.jl, which then gets included. Iโ€™d guess I should also look into the eval function, but this seems like it would be a common annoyance. So once again, if it doesnโ€™t already exist, I am probably trying to misuse structs in some way.

I mean I just made this one using some copypasting and regex in sublime text:

struct lgtble
    Pl  ::Int16
    Team::String15 
    P   ::Int32 
    W   ::Int32 
    D   ::Int32 
    L   ::Int32 
    GF  ::Int32 
    GA  ::Int32 
    GD  ::Int32 
    Pts ::Int32
end

Julia users donโ€™t see the need for a more convenient way than fiddling around in the editor or writing ::Int32 over and over? That makes me think Iโ€™m trying to misuse structs somehow.

Well, if you use the package GitHub - mauro3/Parameters.jl: Types with default field values, keyword constructors and (un-)pack macros you can write:

@with_kw mutable struct Lg_table @deftype Int32
    Pl  ::Int16
    Team::String15 
    P
    W
    D
    L
    GF
    GA
    GD 
    Pts
end

But please use upper case names for types and lower case names for variablesโ€ฆ And names with a lenght of one character are most of the time not a good idea, if you look at your code in a year you donโ€™t remember what they meanโ€ฆ

2 Likes

There are parametric types.

struct lgtble{T}
    Pl  ::Int16
    Team::String15 
    P   ::T
    W   ::T 
    D   ::T 
    L   ::T 
    GF  ::T 
    GA  ::T 
    GD  ::T 
    Pts ::T
end

NamedTuples are also a good choice if you want something you can define quickly. I would still use a more formal struct for your outer groupings like Sim, but some of your inner collections like this lgtble are probably more convenient as a NamedTuple.

julia> nt = (Pl=Int16(3), Team="Eagles", Pts=10)
(Pl = 3, Team = "Eagles", Pts = 10)

julia> typeof(nt)
@NamedTuple{Pl::Int16, Team::String, Pts::Int64}

You can dispatch a function on a named tuple by defining f(x::NamedTuple), but that doesnโ€™t give you the same level of flexibility down the line as defining your own custom types up front like f(x::Sim), f(x::Team), etc. I have seen on this forum that often people prototype with NamedTuples and then formalize those into composite types (structs) once they settle on their design. (This is largely because Revise.jl does not work on structs though.)


Iโ€™m sure there are fancier ways to define structs with macros, eval, and automatic code generation, but mine were never complicated enough to reach for those things.

But not if you want to pre-allocate memoryโ€ฆ Then, if you want to save a data item once per second, why do you want to pre-allocate itโ€ฆ Normally pre-allocation only matters for real-time systemsโ€ฆ

So there are two separate questions:

  1. What is the most logical way to organize my data?
  2. What can I do to make simulations on this data fast?

Start with #1 and organize in whatever way makes the most logical sense. Julia has a lot of flexibility here both with its native types and with the larger package ecosystem that has more specialized types. If everything is a Tuple or DataFrame as you have it now, then it is hard to tell what level of nesting you are in and what the different data frames all are. As a start, I suggest:

  • If your data is a table, put it in a DataFrame.
  • If your data is a set of parameters, put them in a NamedTuple
    (or Dict if you need mutable).
  • If your data is a collection of the same element type, put them in a Vector.
  • If your data is a collection of mismatched types, put them in a custom struct.

Then once you have a runnable example, you can ask #2. It is not too hard to swap types for performance once you have something in place.


One of my custom types for inspiration:

struct LocalFailureResults
    result::String
    maximum_damage::String
    table::DataFrame
    material_data::OrderedDict
end
julia> r = local_failure()
Material Inputs
Select the .xlsx file containing temperature-dependent data for the chosen material.
Selected Material: SA-723-3-2
Enter the average temperature in the material during operation in degrees Fahrenheit: (Default: 400) 

Local Failure Inputs
Select the .xls file containing the local failure principal stress vectors for the entire model.
Select the .xls file containing the local failure equivalent plastic strain for the entire model.

Local Failure Result Fields: result, maximum_damage, table, material_data

julia> r.result
"PASS"

julia> r.maximum_damage
"2.5%"

julia> r.table
36126ร—8 DataFrame
   Row โ”‚ Node Number  Maximum Principal Stress (psi)  Middle Prin โ‹ฏ
       โ”‚ Int64        Float64                         Float64     โ‹ฏ
โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
     1 โ”‚       78415                      73795.0                 โ‹ฏ
     2 โ”‚      190016                      74811.0
     3 โ”‚      138631                      43446.0
   โ‹ฎ   โ”‚      โ‹ฎ                     โ‹ฎ                             โ‹ฑ
 36124 โ”‚      159857                      16592.0
 36125 โ”‚      241883                      16644.0                 โ‹ฏ
 36126 โ”‚      241889                      14510.0
                                   6 columns and 36080 rows omitted

julia> r.material_data
OrderedCollections.OrderedDict{String, Any} with 20 entries:
  "Material"                              => "SA-723-3-2"
  "Minimum Hardness (BHN)"                => missing
  "Yield Strength at Temperature (ksi)"   => 110.3
  "Tensile Strength at Temperature (ksi)" => 135.0
  "R"                                     => 0.817037
  "ฯตโ‚š"                                    => 2.0e-5

There is no saving, just storing the results so I can inspect them at the end of the simulation. Maybe this will clarify:

Right now I am doing:

allres = Vector{Vector{DataFrame}}(undef, nsim)
for i in 1:nsim
   # set new parameter values
   simres = Vector{NamedTuple}(undef, nrep)
   for j in 1:nrep
       # Run nrep simulations that generate tuples of dataframes called lgtble, etc
       simres[j] = (lgtble, ...)
   end
    # calculate metrics and store in dataframe
    allres[i] = [simres, metrics]
end

Maybe that is just the best way to do it?

But would like to have a more organized (named and nested) structure within that Vector{Vector{DataFrame}} instead of a vector of dataframes. And I figured if I could tell it the types of the dataframe columns beforehand this would be more efficient.

You should understand that you are NOT pre-allocating the DataFrames. You can only pre-allocate variables with a fixed size. What is stored in this type are only pointers to the DataFrames. the DataFrames themselves get their memory from the heap when you create them with concrete data. In other words, you cannot pre-allocate DataFrames.

2 Likes

So after realizing I failed to even ask a good question, I started trying to make a MWE for this.

But it ended up as more an intermediate working example (IWE) of ~150 lines of utility/helper functions and about 50 lines in the main loop.

Basically my first julia was about 5 weeks ago and I have seen similar questions on here that elicited similar responses (โ€œIโ€™m not sure what you are trying to doโ€), so it may be helpful to others to have a more complex example of the julian way to store the results of these kinds of nested simulations.

I got hesitant to post it since it ended up so long, but if anyone is interesting/willing to critique it let me know. I would want to know everything I am doing wrong.

If you want to stick to your own construction - rather than something like DrWatson or mlflow - you might be better off starting by storing the data frames in an embedded database like sqlite or duckdb, and using a primary key to join the tables by the id of the simulations. It would push the storage out to disk and you indicated that reducing memory was a concern.

Thanks, but once again your response is due to a bad presentation of the question, apparently. What I am concerned with would all be contained within this:

MLflow.start_run()

So unfortunately that doesnโ€™t look like what I am looking for.