Convert Array to DataFrame

Dear Friends,

I would like to convert a multi-dimensional array to DataFrame. For example

a = [0 1;2 3]

Expected result:


4×3 DataFrame
 Row │ index1     index2   value    
     │ Int64    Int64  Int64
─────┼─────────────────
   1 │     1     1    0
   2 │     1     2    1
   3 │     2     1    3
   4 │     2     2    4

Thus each dimension of the initial array forms a column in the DataFrame and the last column is always the value column. For 2-dimensional arrays I can get this format in the following way:

temp = DataFrame(a, :auto)
temp.rowind = [1:size(a,1)]
temp = stack(temp, 1:2)

Maybe this?

julia> a = [0 1; 2 3];

julia> df = DataFrame(Tables.table(a))
2×2 DataFrame
 Row │ Column1  Column2
     │ Int64    Int64
─────┼──────────────────
   1 │       0        1
   2 │       2        3

Won’t work with more than 2 dimensions though.

This produces the expected result:

using DataFrames
a = reshape(1:24, 2, 3, 4)
it = Iterators.product(axes(a)...)
df = rename!(DataFrame(it), "index".*string.(1:ndims(a)))
df.value = vec(a)
sort!(df)
2 Likes
using DataFrames
A=reshape(1:24, 2,3,4)
idx=[Symbol("index$i") for i in 1:ndims(A)]
nts=[(;zip(idx,Tuple(t))...) for t in CartesianIndices(A)]
df=DataFrame(nts)
df.values=A[:]

df

or more succinctly

DataFrame([(;zip([idx;:values],(i.I...,a))...) for (i,a) in zip(CartesianIndices(A),A)])

Thanks. What is the meaning of

(;zip()...))

?

With 2 dimensional array this produces the error:

ERROR: ArgumentError: DataFrame constructor from a Matrix requires passing :auto as a second argument to automatically generate column names: DataFrame(matrix, :auto)

try in this way (add [:] after CartesianIndices(A))

DataFrame([(;zip([idx;:values],(i.I...,a))...) for (i,a) in zip(CartesianIndices(A)[:],A)])
1 Like

This is (one of) the syntax for constructing a NamedTuple. That is, a structure that associates n symbols to n values.

In the case in question this is to build a namedtuple vector that the DataFrame constructor can use to give you the dataframe you want

Others, better than me, could illustrate the potential of this syntax and comment on its use in these cases

it still remains to be clarified why in the 3-dimensional case the Cartesian index matrix works without the need to vectorize, while in the 2-dimensional case it does not.

julia> using DataFrames

julia> A=reshape(1:24, 2,3,4)
2×3×4 reshape(::UnitRange{Int64}, 2, 3, 4) with eltype Int64:
[:, :, 1] =
 1  3  5
 2  4  6

[:, :, 2] =
 7   9  11
 8  10  12

[:, :, 3] =
 13  15  17
 14  16  18

[:, :, 4] =
 19  21  23
 20  22  24

julia> idx=[Symbol("index$i") for i in 1:ndims(A)]
3-element Vector{Symbol}:
 :index1
 :index2
 :index3

julia> DataFrame([(;zip([idx;:values],(i.I...,a))...) for (i,a) in zip(CartesianIndices(A),A)])
24×4 DataFrame
 Row │ index1  index2  index3  values 
     │ Int64   Int64   Int64   Int64  
─────┼────────────────────────────────
   1 │      1       1       1       1
   2 │      2       1       1       2
   3 │      1       2       1       3
   4 │      2       2       1       4
   5 │      1       3       1       5
   6 │      2       3       1       6
   7 │      1       1       2       7
   8 │      2       1       2       8
   9 │      1       2       2       9
  10 │      2       2       2      10
  11 │      1       3       2      11
  12 │      2       3       2      12
  13 │      1       1       3      13
  14 │      2       1       3      14
  15 │      1       2       3      15
  16 │      2       2       3      16
  17 │      1       3       3      17
  18 │      2       3       3      18
  19 │      1       1       4      19
  20 │      2       1       4      20
  21 │      1       2       4      21
  22 │      2       2       4      22
  23 │      1       3       4      23
  24 │      2       3       4      24

julia> A=reshape(1:12, 3,4)
3×4 reshape(::UnitRange{Int64}, 3, 4) with eltype Int64:
 1  4  7  10
 2  5  8  11
 3  6  9  12

julia> idx=[Symbol("index$i") for i in 1:ndims(A)]
2-element Vector{Symbol}:
 :index1
 :index2

julia> DataFrame([(;zip([idx;:values],(i.I...,a))...) for (i,a) in zip(CartesianIndices(A),A)])
ERROR: ArgumentError: `DataFrame` constructor from a `Matrix` requires passing :auto as a second argument to automatically generate column names: `DataFrame(matrix, :auto)`
Stacktrace:
 [1] DataFrame(matrix::Matrix{NamedTuple{(:index1, :index2, :values), Tuple{Int64, Int64, Int64}}})
   @ DataFrames C:\Users\sprmn\.julia\packages\DataFrames\hFLqf\src\dataframe\dataframe.jl:381
 [2] top-level scope
   @ c:\Users\sprmn\.julia\v1.8\dataframes24.jl:185

julia> DataFrame([(;zip([idx;:values],(i.I...,a))...) for (i,a) in zip(CartesianIndices(A)[:],A)])
12×3 DataFrame
 Row │ index1  index2  values 
     │ Int64   Int64   Int64
─────┼────────────────────────
   1 │      1       1       1
   2 │      2       1       2
   3 │      3       1       3
   4 │      1       2       4
   5 │      2       2       5
   6 │      3       2       6
   7 │      1       3       7
   8 │      2       3       8
   9 │      3       3       9
  10 │      1       4      10
  11 │      2       4      11
  12 │      3       4      12

Another version of this method can fit into one-line:

(df = DataFrame(collect.(zip(Tuple.(keys(A))...)),:auto)).val = vec(A);

Here is an example:

julia> A=reshape(1:12, 3,4)
3×4 reshape(::UnitRange{Int64}, 3, 4) with eltype Int64:
 1  4  7  10
 2  5  8  11
 3  6  9  12

julia> (df = DataFrame(collect.(zip(Tuple.(keys(A))...)),:auto)).val = vec(A);

julia> df
12×3 DataFrame
 Row │ x1     x2     val   
     │ Int64  Int64  Int64 
─────┼─────────────────────
   1 │     1      1      1
   2 │     2      1      2
   3 │     3      1      3
   4 │     1      2      4
   5 │     2      2      5
   6 │     3      2      6
   7 │     1      3      7
   8 │     2      3      8
   9 │     3      3      9
  10 │     1      4     10
  11 │     2      4     11
  12 │     3      4     12

columns names a bit off, but close enough perhaps.

1 Like

I’m not sure but maybe this provides the answer:

https://www.juliabloggers.com/working-with-matrices-in-dataframes-jl-1-0/

In other words, construction of DataFrame from Matrix as single argument is not allowed.

The inconsistency, in my view, is about a different aspect.
The following expression
[(; zip ([idx;: values], (i.I ..., a)) ...) for (i, a) in zip (CartesianIndices (A), A)]

produces an Array {T, 3} in case A = reshape (1:24, 3,2,4)

and a Matrix {T} in the case A = reshape (1:12, 3,4).
Being, ultimately, Matrix {T} == Array {T, 2}, I don’t understand why the DataFrame constructor provides the expected result for the case nsdims = 3 and not for the case ndims = 2.

T is NamedTuple

For the sake of clarity, I would expect it to not work in either case, without explicit vectorization.