Can DataFrames.jl handle multi-dimensional arrays

I was wondering if DataFrames.jl can handle multi-dimensional arrays, meaning an array with 3 or more dimensions.

I can run

using DataFrames
DataFrame(rand(3,4)

and that works fine.

But if I try:

DataFrame(rand(3,4,5))

Then I get a rather cryptic error message
ERROR: ArgumentError: 'Array{Float64,3}' iterates 'Float64' values, which don't satisfy the Tables.jl Row-iterator interface

Does anyone know if there is a plan to add multi-dimensional support to the package. I checked the issue list in github but did not see any issues that explicitly mentioned this–but I did not dig too deeply. So I thought I would ask. Thanks.

1 Like

Currently DataFrames.jl support only data that can be represented as a two-dimensional object (i.e. a list of vectors). There are no plans currently to change this.

However, note that you can easily store any data type in a cell of a DataFrame, so e.g. you can write the following:

julia> DataFrame([rand(2) for _ in 1:4, _ in 1:3])
4Γ—3 DataFrame
β”‚ Row β”‚ x1                    β”‚ x2                    β”‚ x3                     β”‚
β”‚     β”‚ Array{Float64,1}      β”‚ Array{Float64,1}      β”‚ Array{Float64,1}       β”‚
β”œβ”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ 1   β”‚ [0.382557, 0.0063816] β”‚ [0.367795, 0.301294]  β”‚ [0.083309, 0.465583]   β”‚
β”‚ 2   β”‚ [0.424023, 0.244837]  β”‚ [0.233543, 0.834364]  β”‚ [0.00453236, 0.548186] β”‚
β”‚ 3   β”‚ [0.392719, 0.628895]  β”‚ [0.979969, 0.534259]  β”‚ [0.588646, 0.825887]   β”‚
β”‚ 4   β”‚ [0.980133, 0.495353]  β”‚ [0.291205, 0.0895148] β”‚ [0.535076, 0.982956]   β”‚

where effectively you have a third dimension nested as a cell value of a two dimensional DataFrame.

3 Likes

or

X = rand(3,4,5)
df = DataFrame(X=collect(eachslice(X, dims=3)))
2 Likes

You may also be looking for things like AxisArrays, which have some similarities but allow any number of dimensions.

See also NamedDims and a small zoo of other recent packages discussed here.

3 Likes

I work a lot with panel data, for which a β€œ3D” representation often seems natural at first. In this context I always find it interesting that Python’s pandas is named after paneldata, and indeed used to have a Panel object in addition to the standard DataFrame, but this has been deprecated for a while now in favour of multi-indices on a regular DataFrame: https://pandas.pydata.org/pandas-docs/version/0.23.4/generated/pandas.Panel.html

@nilshg this multi-indexing idea is interesting, but seems rather confusing. Like I get the basic idea where instead of having a 3-D volume, you can basically flatten the 3-D into a 2-D array but have a multi-index to recover or iterate over that 3-D structure. So if I run a model for 100 runs, and each model has a duration of 10 years. Then I would normally have a volume of [year, parameters, run]. So each run would have a 10 rows by n number of columns, and then the depth dimension would reference the run number. But I could convert this to a multi-index where I have a 2-D array with a β€œyear” and a β€œrun” column. So the (run, year) would be the multi-index.

But is there a good explanation of how to use multi-indices in a Julia dataframe. The python pandas documentation was never very clear about this, as far as I remember. Or I always ran into issues with some implicit conversions that turned things into multi-indices and I had to get them out of the multi-index. So if there is a good explanation of this please pass along the link. I can check out the Julia dataframe docs in the mean time.

But is there a good explanation of how to use multi-indices in a Julia dataframe.

Julia DataFrames don’t have indexes, you can do by and groupby on any of them, and get the same effect.
which actually makes it much clearer than Pandas, but slower potentially if you do the same grouping again and again.
(used to be multindexes were super complex because indexes are like columns but not)