Transform! to destructure NamedTuple into columns

Hi all,

I would like to destructure a DataFrame with NamedTuples into separate columns, using the keys as column names. For a similar problem with an array, I used something like transform!(df, :col => :identity => new_names). Is there a way I can take this:

2×2 DataFrame
 Row │ x         y              
     │ Float64   NamedTup…      
─────┼──────────────────────────
   1 │ 0.222043  (a = 1, b = 2)
   2 │ 0.72646   (a = 3, b = 5)

and obtain this:

2×4 DataFrame
 Row │ x         y               a      b     
     │ Float64   NamedTup…       Int64  Int64 
─────┼────────────────────────────────────────
   1 │ 0.222043  (a = 1, b = 2)      1      2
   2 │ 0.72646   (a = 3, b = 5)      3      5

Thanks!

MWE

using DataFrames

df = DataFrame(x = rand(2), y =[(a=1,b=2),(a=3,b=5)])

df_new = DataFrame(x = df.x, y = df.y, a = [1,3], b = [2,5])

You can use AsTable:

julia> transform(df, :y => AsTable)
2×4 DataFrame
 Row │ x         y               a      b     
     │ Float64   NamedTup…       Int64  Int64 
─────┼────────────────────────────────────────
   1 │ 0.459213  (a = 1, b = 2)      1      2
   2 │ 0.241038  (a = 3, b = 5)      3      5
3 Likes

Very nice. Thanks!

This seems to be equivalent to:

hcat(df, DataFrame(df.y))
3 Likes

Yes, but it will do more allocations (which is a minor issue but still might be relevant occasionally).

1 Like

Obviously I am doing something wrong, but for the small OP example I actually see less allocations?

1 Like

Ah yes I also see less allocations with your solution… @bkamins ?

By the way, since [a b] is fancy syntax for hcat(a, b) you can also write

[df DataFrame(df.y)]
2 Likes

Ah - you are right:

julia> df = repeat(DataFrame(x=1, y=(a=1,b=2)), 10^8);

julia> @time transform(df, :y => AsTable);
  5.982615 seconds (200.00 M allocations: 9.686 GiB, 7.61% gc time)

julia> @time [df DataFrame(df.y)];
  1.345369 seconds (61 allocations: 5.215 GiB, 6.20% gc time)

This means that I need to optimize the internals of transform :smile:.

This guarantees that there is no aliasing between source and target and at the same time that we do not do unnecessary allocations:

julia> @time hcat(copy(df), DataFrame(df.y), copycols=false);
  1.231519 seconds (67 allocations: 3.725 GiB, 35.97% gc time)
3 Likes