Deconstruct NamedTupleColumn with Prefix

Similar to this one: Transform! to destructure NamedTuple into columns

I’d like to do the same, but be able to programmatically prefix the "child"columns with the “parent” prefix.

That is in the linked MWE, I would like to get column names x_a and x_b (automatically)

Something like

nms = df.x[1] |> keys .|> string
rename!(df, (nms .=> "x" .* nms)...)

works but feels cumbersome. Can it be done directly in the transform statement?

Thanks!

Nothing super easy in DataFrames.jl for this, unfortunately. You could do

julia> function add_prefix(nt, pre)
           nms = Symbol.(pre, "_", propertynames(nt))
           vals = values(nt)
           NamedTuple{nms}(vals)
       end
add_prefix (generic function with 1 method)

julia> df = DataFrame(x = rand(2), y =[(a=1,b=2),(a=3,b=5)])
2×2 DataFrame
 Row │ x         y
     │ Float64   NamedTup…
─────┼──────────────────────────
   1 │ 0.485175  (a = 1, b = 2)
   2 │ 0.822109  (a = 3, b = 5)

julia> transform(df, :y => ByRow(t -> add_prefix(t, "y")) => AsTable)
2×4 DataFrame
 Row │ x         y               y_a    y_b
     │ Float64   NamedTup…       Int64  Int64
─────┼────────────────────────────────────────
   1 │ 0.485175  (a = 1, b = 2)      1      2
   2 │ 0.822109  (a = 3, b = 5)      3      5

But that doesn’t give you the name “y” automatically. If you really want access to column names inside the fun of src => fun => dest you could do AsTable(src) => ... but that’s probably more trouble than its worth.

1 Like

Another option would be @unnest_wider from TidierData. It will automatically prefix the column names, but it will drop the original column which might add some performance benefit

julia> using TidierData; df = DataFrame(x = rand(2), y =[(a=1,b=2),(a=3,b=5)]);

julia> @time transform(df, :y => ByRow(t -> add_prefix(t, "y")) => AsTable)
  0.031335 seconds (67.12 k allocations: 3.632 MiB, 98.90% compilation time)
2×4 DataFrame
 Row │ x         y               y_a    y_b   
     │ Float64   NamedTup…       Int64  Int64 
─────┼────────────────────────────────────────
   1 │ 0.478783  (a = 1, b = 2)      1      2
   2 │ 0.741308  (a = 3, b = 5)      3      5

julia> @time @unnest_wider(df, y)
  0.000086 seconds (96 allocations: 4.539 KiB)
2×3 DataFrame
 Row │ x         y_a    y_b   
     │ Float64   Int64  Int64 
─────┼────────────────────────
   1 │ 0.478783      1      2
   2 │ 0.741308      3      5
1 Like

It’s worth noting that if the add_prefix method is in a function, there is no performance different (it’s all due to compilation of the anonymous function)

julia> using DataFrames

julia> df = DataFrame(x = rand(2), y =[(a=1,b=2),(a=3,b=5)]);

julia> function add_prefix(nt, pre)
                  nms = Symbol.(pre, "_", propertynames(nt))
                  vals = values(nt)
                  NamedTuple{nms}(vals)
              end
add_prefix (generic function with 1 method)

julia> foo(df) = transform(df, :y => ByRow(t -> add_prefix(t, "y")) => AsTable)
foo (generic function with 1 method)

julia> @time foo(df); # Warmup
  0.032263 seconds (140.54 k allocations: 7.444 MiB, 99.42% compilation time)

julia> @time foo(df); # After warmup
  0.000157 seconds (134 allocations: 5.594 KiB)

So it’s not really “performance” per-se, as much as less lag when interacting at the REPL or running a script in global scope (which may be common).

1 Like

If I’m not mistaken, this appears to be a more efficient approach:

hcat(df, DataFrame(df.y, [Symbol(:x_, k) for k in keys(first(df.y))]))

Yeah my implementation is not very good. If OP really cares about performance it would be best to not re-calculate the names of the tuples every time.

julia> function add_prefix(nt, pre)
           nt_long = Tables.columntable(nt)
           nms = Symbol.(pre, "_", propertynames(nt_long))
           NamedTuple{nms}(values(nt_long))
       end;

julia> df = DataFrame(x = rand(2), y =[(a=1,b=2),(a=3,b=5)]);

julia> @transform! df $AsTable = add_prefix(:y, "y")
2×4 DataFrame
 Row │ x          y               y_a    y_b   
     │ Float64    NamedTup…       Int64  Int64 
─────┼─────────────────────────────────────────
   1 │ 0.0859943  (a = 1, b = 2)      1      2
   2 │ 0.743904   (a = 3, b = 5)      3      5

This is definitely less useable than the TidierData version, of course. I can think of an improvement in DataFramesMeta.

2 Likes