From DataFrame to multidimensional Array

Say I have a DataFrame with columns named :i, :j, :k that are positive integers, and meant jointly form a unique identifier. df also contains a column named :z that can be of any type that is meant to hold the data for indices :i, :j, :k:

df = DataFrame(rand(1:10, 3, 4))
rename!(df, [:i, :j, :k, :z])

How can I convert that to an array A such that A[i, j, k] holds the data in :z for indices :i = i, :j = j, :k = k?

In principle, I can populate the array A entry by entry with a for-loop, but thought there should be a more elegant way of doing that.

I’m not aware of such a method, but I’d guess that such a method would end up looking like a for loop, so I think the question of elegance is mostly: do you care about the for loop existing at all in the internals or are you looking for an abstraction?

1 Like

I don’t care about the for loop existing behind scenes. Just thought there could be a canned way to do that.

I’m not aware of one, but it might exist if you dig into the DataFrames code. Seems worth making a PR if you write your own solution you’re proud of.

Here is an MWE for a Matrix. It’s not that performant because of the use of eachrow, but it should be easy to see how to make the last step into a fast function.

This MWE also assumes we know the size of the matrix before hand, but this could also be gotten quite easily from calling maximum on each of the index columns.

julia> df = DataFrame(i = Int[], j = Int[])
0Γ—2 DataFrame


julia> t = Base.Iterators.product(1:4, 1:6);

julia> m = Array{Float64}(undef, 4, 6);

julia> for ti in t
       push!(df, ti)
       end

julia> df.val = rand(24);

julia> for row in eachrow(df)
       m[row.i, row.j] = row.val
       end

Just guessing, but might it be possible to (ab?)use a SparseMatrix for this? If this was two dimensions, you could go with the built in sparse matrix, for higher dimensions I found this package which unfortunately seems unmaintained: https://github.com/jw3126/SimpleSparseArrays.jl

This also seems to be relevant:
https://github.com/davidavdav/NamedArrays.jl/pull/96

1 Like

Thank you all. I just did an ugly loop with DataFramesMeta’s @where. It was surprisingly fast … or not, it’s Julia :slight_smile:

1 Like

I needed something along these lines, here is the function I wrote that works for any number of dimensions:

function df2array(df, vname, cnames)
    size = [maximum(df[:,n]) for n in cnames]
    R = zeros(Float64,size...)

    for i in 1:nrow(df)
        R[CartesianIndex( vec(convert(Array, df[i,cnames]))...  ) ] = df[i,vname]
    end
    R  
end

it could definitely use some typing.

2 Likes

Since this was revived, here’s one way:

julia> using DataFrames, AxisKeys

julia> df = DataFrame(rand(1:10, 3, 4), [:i, :j, :k, :z])
3Γ—4 DataFrame
 Row β”‚ i      j      k      z     
     β”‚ Int64  Int64  Int64  Int64 
─────┼────────────────────────────
   1 β”‚    10      3      9      4
   2 β”‚     5      7      4      2
   3 β”‚    10      2      4      3

julia> wrapdims(df, :z, :i, :j, :k; default=missing, sort=true)
3-dimensional KeyedArray(NamedDimsArray(...)) with keys:
↓   i ∈ 2-element Vector{Int64}
β†’   j ∈ 3-element Vector{Int64}
β–‘   k ∈ 2-element Vector{Int64}
And data, 2Γ—3Γ—2 Array{Union{Missing, Int64}, 3}:
[:, :, 1] ~ (:, :, 4):
       (2)         (3)         (7)
  (5)     missing     missing    2
 (10)    3            missing     missing

[:, :, 2] ~ (:, :, 9):
       (2)         (3)         (7)
  (5)     missing     missing     missing
 (10)     missing    4            missing

julia> df2 = DataFrame(rand(Int8, 3, 4), [:i, :j, :k, :z])
3Γ—4 DataFrame
 Row β”‚ i     j     k     z    
     β”‚ Int8  Int8  Int8  Int8 
─────┼────────────────────────
   1 β”‚  119    93   -48   -70
   2 β”‚  -78  -115   -73   -61
   3 β”‚   42  -122    97   -92

julia> wrapdims(df2, :z, :i, :j, :k; default=0)
3-dimensional KeyedArray(NamedDimsArray(...)) with keys:
↓   i ∈ 3-element Vector{Int8}
β†’   j ∈ 3-element Vector{Int8}
β–‘   k ∈ 3-element Vector{Int8}
And data, 3Γ—3Γ—3 Array{Union{Int64, Int8}, 3}:
[:, :, 1] ~ (:, :, -48):
        (93)  (-115)  (-122)
 (119)   -70       0       0
 (-78)     0       0       0
  (42)     0       0       0

[:, :, 2] ~ (:, :, -73):
        (93)  (-115)  (-122)
 (119)     0       0       0
 (-78)     0     -61       0
  (42)     0       0       0

[:, :, 3] ~ (:, :, 97):
        (93)  (-115)  (-122)
 (119)     0       0       0
 (-78)     0       0       0
  (42)     0       0     -92

This goes via Tables.jl. Could use more polish, ideally would be able to infer :i, :j, :k, and accept "i", "j", "k". but should work. There may be similar functions in the other similar packages.

5 Likes