Should I use either Dataframes.jl or Named Array for a long and wide array for sci computing

ElOceanografo · July 25, 2019, 8:10pm

For the first task, if each column is some summary statistic that you want to be able to refer to by name, a DataFrame is the natural choice.

For the second one, an Array (either a basic one or a NamedArray) will work. If you haven’t found them yet, most of the basic functions in Statistics accept an optional argument called dims that lets you apply them along whatever dimensions of an array you want:

using Statistics
A = randn(2,3,4)
mean(A, dims=1) # 1x3x4 array
mean(A, dims=(1, 3)) # 1x3x1 array

If you want to apply your own function, you can use mapslices:

mapslices(x -> 2.5mean(x)+1, A, dims=(1,3))

There are actually ways to do either task with an Array or DataFrame. For instance, you could have a “long format” DataFrame with one column for the simulation number, another column for the timestep within that simulation, and then 100 columns with the summary statistics. Then you could calculate aggregate stats using the split-apply-combine functionality in DataFrames and DataFramesMeta, e.g.:

using DataFrames, DataFramesMeta
df = DataFrame(
    sim = [1, 1, 2, 2, 3, 3], 
    timestep = [1, 2, 1, 2, 1, 2],
    stat1 = randn(6),
    stat2 = randn(6))
@by(df, :sim, summary1 = mean(:stat1), summary2=mean(:stat2))

This will be a bit slower than mapslices, but may be clearer to read, depending on your preference and what kinds of summaries and transformations you’re doing.

Topic		Replies	Views
TimeSeries Arrays in Julialang General Usage statistics	2	799	May 30, 2020
What's the best way to work with millions of rows of data? Performance	7	2058	February 24, 2020
Performance: Fast way to access numbers in Dataframes or alternatives Performance dataframes , data_structures	12	1144	November 15, 2022
Performance of NamedArrays vs Dictionaries of Tuples General Usage	2	1839	August 28, 2020
DataFrames.jl - Choosing between the core functions and available libraries (Query.jl, DataFramesMeta.jl, etc) Data	10	2052	September 15, 2018

Should I use either Dataframes.jl or Named Array for a long and wide array for sci computing

Related topics