Expanding Named Tuples

Starting from this situation (or similar),

julia> hcat(DataFrame(a=[1,2]), [(b=1,c=2), (b=3,c=4)])
2×2 DataFrame
 Row │ a      x1             
     │ Int64  NamedTup…      
─────┼───────────────────────
   1 │     1  (b = 1, c = 2)
   2 │     2  (b = 3, c = 4)

is there a direct way to obtain this result?

In general, in what ways can a flat DataFrame be obtained, starting from such a situation?

2×3 DataFrame
 Row │ a      b      c     
     │ Int64  Int64  Int64 
─────┼─────────────────────
   1 │     1      1      2
   2 │     2      3      4

Not clear what the real situation is, but this gives the desired result:

julia> hcat( DataFrame(a=[1,2]), DataFrame([(b=1,c=2),(b=3,c=4)]) )
2×3 DataFrame
 Row │ a      b      c
     │ Int64  Int64  Int64
─────┼─────────────────────
   1 │     1      1      2
   2 │     2      3      4

But of course, I have just changed the situation to get this.
What is your real starting point?

1 Like

No real situation.
I was reading this PR and I asked myself the question.

I would expect a solution of the following type, (ie having as arguments the dataframe and the column to expand) but without hcat.

dfnt=hcat(DataFrame(a=[1,2]), [(b=1,c=2), (b=3,c=4)])
hcat(dfnt.a,DataFrame(dfnt.x1))

The problem (as explained in the PR you linked) is that we already define hcat(df::AbstractDataFrame, v::AbstractVector). We would like to add something along the lines of

function hcat(df::AbstractDataFrame, t::Any)
    if Tables.istable(t)
        hcat(df, DataFrame(t; copycols = false)
    end
end

But of course, we can’t use dispatch to check if something is a Table in the Tables.jl-sense. And there are many things that are both <: Vector and satisfy Tables.istable, like a vector of named tuples.

But we are post 1.0, so we can’t break hcat(df::AbstractDataFrame, v::AbstractVector). So no, you won’t be able to do

hcat(DataFrame(a=[1,2]), [(b=1,c=2), (b=3,c=4)])

and have it automatically flatten. That would break a post-1.0 guarantee of stability.

2 Likes

Hi @pdeffebach,

I’m not sure if I understand your answer correctly (my knowledge of Julia is very limited), but I would like to be sure that I have asked my question correctly and so I try to ask it again.

If after somehow transforming a dataframe (perhaps obtained by reading a JSON file !?) I get some columns that are vectors of named tuples, how can I expand/flat them to obtain distinct columns corresponding to the names of the keys of the named tuples?

PS
I don’t intend to use / modify the hcat function

Calling DataFrame(v) should work.

If you have this scenario

julia> vnt = [(a = 1, b = (c = 2, d = 3)), (a = 4, b = (c = 5, d = 6))]
2-element Vector{NamedTuple{(:a, :b), Tuple{Int64, NamedTuple{(:c, :d), Tuple{Int64, Int64}}}}}:
 (a = 1, b = (c = 2, d = 3))
 (a = 4, b = (c = 5, d = 6))

then I’m not 100% what the solution is, but I’m sure other people can help out. Here is one solution with recursion

julia> vnt = [(a = 1, b = (c = 2, d = 3)), (a = 4, b = (c = 5, d = 6))];

julia> function unnest!(d, nt)
           for (n, v) in pairs(nt)
               if v isa NamedTuple
                   unnest!(d, v)
               else
                   push!(d, n => v)
               end
           end
       end;

julia> function unnest(nt)
           d = Dict{Symbol, Any}()
           unnest!(d, nt)
           return d
       end;

julia> Tables.istable(unnest.(vnt))
true

julia> DataFrame(unnest.(vnt))
2×3 DataFrame
 Row │ a      d      c     
     │ Int64  Int64  Int64 
─────┼─────────────────────
   1 │     1      3      2
   2 │     4      6      5

But this solution has some problems. In particular, it won’t have consistent column ordering (this can be fixed using an ordered dict from OrderedCollections.jl).

But I feel like we have good solutions to this problem that I am not finding at the moment.

EDIT: Also look at JSONTables.jl, for a particular JSON-oriented use-case

julia> t=hcat(DataFrame(a=[1,2]), [(b=1,c=2), (b=3,c=4)])
2×2 DataFrame
 Row │ a      x1
     │ Int64  NamedTup…
─────┼───────────────────────
   1 │     1  (b = 1, c = 2)
   2 │     2  (b = 3, c = 4)

From a column like x1 above, you can proceed with:

julia> DataFrame(t[!,:x1])
2×2 DataFrame
 Row │ b      c
     │ Int64  Int64
─────┼──────────────
   1 │     1      2
   2 │     3      4

Just saying, perhaps that’s what you are looking for.

No. This is not exactly what I am looking for, because I would like to have the whole dataframe as an output.

Something like this (which came to my mind trying to figure out the functions of @pdeffebach )

df=DataFrame(vnt)
ransform(df, :b=>ByRow(x->(x.c,x.d))=>[:c,:d])

Well, the whole DataFrame is than this:

julia> hcat(DataFrame(a=t[!,:a]),DataFrame(t[!,:x1]))
2×3 DataFrame
 Row │ a      b      c
     │ Int64  Int64  Int64
─────┼─────────────────────
   1 │     1      1      2
   2 │     2      3      4

But, never mind, it seems I am missing the point. I also didn’t read the PR you linked.

use AsTable as the output

df=DataFrame(vnt)
transform(df, :b=>ByRow(identity) => AsTable)
2 Likes
istableval(x) = Val(Tables.istable(x)) # const prop should make this infer correctly
hcat(df::AbstractDataFrame, t) = _hcat(istableval(t), df, t)
_hcat(::Val{true}, df, t) = hcat(df, DataFrame(t; copycols = false))
_hcat(::Val{false}, df, t) = ...

I placed the Val as the first argument of _hcat because hcat often accepts Varargs.

Fair enough! Perhaps this technique can be used internally in DataFrames more.

Still doesn’t get around the problem that things are both <: AbstractVector and tables.

it seems that ByRow is not needed

select(transform(df, :b=>identity=>AsTable), Not(:b))

just out of curiosity, why doesn’t the following expression work like the previous one?

select(transform(df, :b=>AsTable), Not(:b))

I think it probably should work, and I’ve filed an issue here. But no guarantees because the mini-language is complicated enough as-is.

1 Like