DataFrame to multidimensional array

Hey,

Suppose I have a Dataframe that corresponds to a multidimensional function. Something like this:
df = DataFrame([(a = x, b = y, c = z,d = x+y+z) for x in 1:6 for y in 1:4 for z in 1:2])

What is the best way to transform it into the following multidimensional array
[x+y+z for x in 1:6, y in 1:4 , z in 1:2]

Currently I’m doing it with groupby but it gets complicated with higher dimensions (I’m a Matlab user, so I used to work with multidimensional arrays).

Thank you

Are x,y,z integer?

I assume you know the sizes of x, y and z? Inferring those from the vectors first would be a more annoying step.

df = DataFrame([(a = x, b = y, c = z,d = x+y+z) for x in 1:6 for y in 1:4 for z in 1:2])

arr = permutedims(reshape(copy(df.d), (2, 4, 6)), (3, 2, 1))

This gives:

6×4×2 Array{Int64, 3}:
[:, :, 1] =
 3  4   5   6
 4  5   6   7
 5  6   7   8
 6  7   8   9
 7  8   9  10
 8  9  10  11

[:, :, 2] =
 4   5   6   7
 5   6   7   8
 6   7   8   9
 7   8   9  10
 8   9  10  11
 9  10  11  12

julia> arr == [x+y+z for x in 1:6, y in 1:4 , z in 1:2]
true

Note that for x in 1:6 for y in 1:4 for z in 1:2 has exactly the opposite order of dimensions than for x in 1:6, y in 1:4, z in 1:2 which is why the permutedims is needed.

2 Likes

Here is an alternative using TensorCast.jl:

using TensorCast
@cast v[i,j,k] := copy(df.d)[k⊗j⊗i] (i ∈ 1:6, j ∈ 1:4, k ∈ 1:2)
3 Likes

Depending on how you obtain the data in the first place, you may refactor that process to return a multidimensional array instead of a dataframe. Arrays are indeed easy and convenient to use in julia, and they are more general.

But for this particular operation, there’s a nice table → multi dim array conversion function in AxisKeys.jl:

julia> using AxisKeys

julia> wrapdims(df, :d, :a, :b, :c)
3-dimensional KeyedArray(NamedDimsArray(...)) with keys:
↓   a ∈ 6-element Vector{Int64}
→   b ∈ 4-element Vector{Int64}
◪   c ∈ 2-element Vector{Int64}
And data, 6×4×2 Array{Int64, 3}:
[:, :, 1] ~ (:, :, 1):
      (1)  (2)  (3)  (4)
 (1)    3    4    5    6
 (2)    4    5    6    7
 (3)    5    6    7    8
 (4)    6    7    8    9
 (5)    7    8    9   10
 (6)    8    9   10   11

[:, :, 2] ~ (:, :, 2):
      (1)  (2)  (3)  (4)
 (1)    4    5    6    7
 (2)    5    6    7    8
 (3)    6    7    8    9
 (4)    7    8    9   10
 (5)    8    9   10   11
 (6)    9   10   11   12

It would even work with non-consecutive or non-numeric x, y, z values.

2 Likes

The performance of wrapdims() on this specific example seems to be way subpar.

Can this be combined with DataFrames groupby or @by to wrap a sub-set of the dimensions, to produce a DataFrame where one of the columns is a KeyedArray.

The DataFrames guys are working on nest and unnest which would make this easier but they aren’t released yet.