Should I use either Dataframes.jl or Named Array for a long and wide array for sci computing

00krishna · July 25, 2019, 5:10pm

I am writing a simulation that captures a lot of statistics from a mathematical model. Hence I have an array that is long (like 100,000) rows, but also very wide, like 100 columns. I am actually moving the simulation from Python to Julia because of slow performance in Python.

I am still new to Julia, so I don’t know the in’s and out’s of the different array libraries. One thing is that I would like the ability to index or slice the array by column name. That is just because with so many columns it is easy to make errors with column indexes.

The other thing that I have to do is run multiple simulations which I stack into a volume, so I usually have arrays on the order of 100,000 x 100 x 1000, where the final dimension is number of simulations. The final step is computing summaries over the volume, such as means and standard deviations.

I saw that both Named Arrays and Dataframes support named indexing. I was just trying to figure out what the performance differences and potential issues might be from choosing one package versus the other. Dataframes.jl seems to be popular and has a very familiar named indexing interface. But I was not sure if that library was meant for lots of “writes” to the array. Named Arrays also seem good, but I was not sure how actively developed this project is, since I have not seen many posts on the discourse forum about it lately.

I also could not find any info on performance comparisons between Names Arrays versus Dataframes, especially in the context of wide arrays. I did find a post about performance on Dataframes.jl, but it seems like the DataframesMeta.jl package might have improved performance on dataframes compared to before.

Since the reason I am migrating from Python to Julia was due to slow performance, I was hoping someone could set me on the right track in terms of efficiency in Julia for my data structure. Thanks.

ElOceanografo · July 25, 2019, 8:10pm

For the first task, if each column is some summary statistic that you want to be able to refer to by name, a DataFrame is the natural choice.

For the second one, an Array (either a basic one or a NamedArray) will work. If you haven’t found them yet, most of the basic functions in Statistics accept an optional argument called dims that lets you apply them along whatever dimensions of an array you want:

using Statistics
A = randn(2,3,4)
mean(A, dims=1) # 1x3x4 array
mean(A, dims=(1, 3)) # 1x3x1 array

If you want to apply your own function, you can use mapslices:

mapslices(x -> 2.5mean(x)+1, A, dims=(1,3))

There are actually ways to do either task with an Array or DataFrame. For instance, you could have a “long format” DataFrame with one column for the simulation number, another column for the timestep within that simulation, and then 100 columns with the summary statistics. Then you could calculate aggregate stats using the split-apply-combine functionality in DataFrames and DataFramesMeta, e.g.:

using DataFrames, DataFramesMeta
df = DataFrame(
    sim = [1, 1, 2, 2, 3, 3], 
    timestep = [1, 2, 1, 2, 1, 2],
    stat1 = randn(6),
    stat2 = randn(6))
@by(df, :sim, summary1 = mean(:stat1), summary2=mean(:stat2))

This will be a bit slower than mapslices, but may be clearer to read, depending on your preference and what kinds of summaries and transformations you’re doing.

bkamins · July 25, 2019, 8:47pm

On top of a very nice @ElOceanografo comment I think it is worth to discuss which part of the process you are describing is slow. The beauty of Julia and its ecosystem is that there are efficient solutions for different use cases (it is also a part of the pain: as there is “no one best solution for all cases” - like in databases: you have columnar and row databases that are optimized for different use cases).

The key questions I would like to ask before giving a recommendation are:

what is the slow part in your case: collection or later analysis of data;
the size of data you are describing is ~80 GBs, so the question is if you have enough RAM to store it or you need to do the computations out-of-core?
are columns homogeneous in type or heterogeneous?;
finally does the aggregation is long (i.e. averages of variables in one column) or wide (aggregation across variables) - this will have an influence on the answer as for both a different memory layout is needed.

00krishna · July 25, 2019, 9:27pm

@ElOceanografo Man, this is so great. What a huge amount of information, thanks so much. Yeah, this helps a lot. So I can essentially try either approach, which is nice. I was worried that the Dataframe package might be more of a decorator around an array which was useful for things like plotting, but not necessarily for the actual number crunching. But seems like it should work. I can try and do some simple benchmarking to see how the two compare.

@bkamins Very good question. So I was using Python for the initial version of the model and that ran pretty fast but I was using Numpy Structured Arrays and they were really cumbersome to code or maintain. So I rewrote the data crunching using xarray and that proved to be really slow. I think the bottleneck was likely in the way that I was add each simulation to the xarray volume. Note sure why, but that seemed to be particularly slow in my basic profiling analysis.

Otherwise, the data is homogeneous. The data is all floating point data. I mean there is some count information which would be integer, but since I am averaging that data it works out to be floating point anyway. There are a few columns where I add a timestamp for the date the model was run, as well as a few text columns for notes and the model version–but I can add those in post processing.

Luckily the volume of data has not overwhelmed my RAM as yet. I have had to keep the number of model runs short of my target because it was just taking too long to run. If I need to, I m

I figure just by migrating to Julia I should get at least some speedup since Julia compiles to LLVM versus python byte code. But I just wanted to make sure I was using the right libraries for storing and processing the data.

I really appreciate both of you chiming in. It really helps to hear the voice of experience on stuff like there, where it is hard to know the limitations of a library before you get deep into it.

bkamins · July 25, 2019, 9:46pm

If your data is homogeneous you can also have a look at GitHub - JuliaArrays/AxisArrays.jl: Performant arrays where each dimension can have a named axis with values package. And then make sure that you aggregate across columns (Julia uses column-major storage order as opposed do Python which uses row-major storage order).

00krishna · July 25, 2019, 10:12pm

@bkamins Oh interesting, I had not seen AxisArrays before. I will definitely take a look at that library.

ElOceanografo · July 25, 2019, 11:33pm

Well, technically DataFrames are just a decorator around a collection of 1-D arrays…you can check out their definition for yourself ! But there’s been a ton of work put into methods for slicing, dicing, iterating, and aggregating them in fast and convenient ways.

You can try both approaches and see what works better for you, but as long as you’re preallocating the Array/DataFrame, writing simulation results into either will be fast. Creating and filling a DataFrame has a little bit more overhead than a plain Array, but any difference is probably negligible next to running the actual model.

bkamins · July 25, 2019, 11:41pm

Right. The benefit of DataFrame is that you can push! consecutive rows into it. It has a small overhead, but it is convenient IMO and most probably, as you have noted, the cost of core computations will be of orders of magnitude higher.

Just to show to @00krishna what I mean:

julia> using DataFrames

julia> df = DataFrame()
0×0 DataFrame


julia> for i in 1:10
       push!(df, (runid=i, a=rand(), b=rand()))
       end

julia> df
10×3 DataFrame
│ Row │ runid │ a        │ b        │
│     │ Int64 │ Float64  │ Float64  │
├─────┼───────┼──────────┼──────────┤
│ 1   │ 1     │ 0.829414 │ 0.82911  │
│ 2   │ 2     │ 0.554896 │ 0.276062 │
│ 3   │ 3     │ 0.40091  │ 0.478588 │
│ 4   │ 4     │ 0.651059 │ 0.90763  │
│ 5   │ 5     │ 0.677377 │ 0.833082 │
│ 6   │ 6     │ 0.673965 │ 0.338277 │
│ 7   │ 7     │ 0.863652 │ 0.392971 │
│ 8   │ 8     │ 0.63527  │ 0.38427  │
│ 9   │ 9     │ 0.955796 │ 0.427927 │
│ 10  │ 10    │ 0.224568 │ 0.839056 │

Topic		Replies	Views
Newbie : Accessing DataFrame with row and column names New to Julia dataframes	5	1905	February 19, 2020
Performance: Fast way to access numbers in Dataframes or alternatives Performance dataframes , data_structures	12	1182	November 15, 2022
Most popular tabular/multidimensional data types in Julia New to Julia data , type , dataframes	18	1312	December 8, 2021
Hierarchical or multi-index for data frames Data	10	7395	October 9, 2019
How to convert namedarray to dataframe while preserving index/column names General Usage	1	1011	February 8, 2020

Should I use either Dataframes.jl or Named Array for a long and wide array for sci computing

Related topics