Should I use either Dataframes.jl or Named Array for a long and wide array for sci computing

For the first task, if each column is some summary statistic that you want to be able to refer to by name, a DataFrame is the natural choice.

For the second one, an Array (either a basic one or a NamedArray) will work. If you haven’t found them yet, most of the basic functions in Statistics accept an optional argument called dims that lets you apply them along whatever dimensions of an array you want:

using Statistics
A = randn(2,3,4)
mean(A, dims=1) # 1x3x4 array
mean(A, dims=(1, 3)) # 1x3x1 array

If you want to apply your own function, you can use mapslices:

mapslices(x -> 2.5mean(x)+1, A, dims=(1,3))

There are actually ways to do either task with an Array or DataFrame. For instance, you could have a “long format” DataFrame with one column for the simulation number, another column for the timestep within that simulation, and then 100 columns with the summary statistics. Then you could calculate aggregate stats using the split-apply-combine functionality in DataFrames and DataFramesMeta, e.g.:

using DataFrames, DataFramesMeta
df = DataFrame(
    sim = [1, 1, 2, 2, 3, 3], 
    timestep = [1, 2, 1, 2, 1, 2],
    stat1 = randn(6),
    stat2 = randn(6))
@by(df, :sim, summary1 = mean(:stat1), summary2=mean(:stat2))

This will be a bit slower than mapslices, but may be clearer to read, depending on your preference and what kinds of summaries and transformations you’re doing.

2 Likes