Creating a new column containing DataFrames itself (e.g. from "complex" function output)

I am not sure how to best deal with function output that is not a simple scalar, when e.g. working with combine, e.g. the follow example to Calculate deciles for each numeric variable and each group:

## Create Dataset to analyze
df = @chain begin
DataFrame(rand(100,3), :auto)
@transform :gr = repeat('A':'D'; inner=25)
end

## Calculate deciles for each numeric variable and each group
## and save them in a Dataframe (group x variable)

using Statistics

deciles = collect(0.0:0.1:1.0)
decile_fun(x) = quantile(x, deciles)

## This works
dstat = combine(groupby(df, :gr), names(df, Number) .=> (x-> Dict(zip("Q_".*string.(deciles), decile_fun(x))) ))
dstat = select(dstat, :gr, names(dstat, Dict) .=> ByRow(DataFrame))

## Also this one: it gives one long format dataframe
dstat2 = combine(groupby(df, :gr), names(df, Number) .=> decile_fun, :gr => (x->deciles) => :qunt)

## This fails
dstat_fail1 = combine(groupby(df, :gr), names(df, Number) .=> (x-> DataFrame(Dict(zip("Q_".*string.(deciles),  decile_fun(x)))))) 
dstat_fail2 = combine(groupby(df, :gr), names(df, Number) .=> (x-> DataFrame(qunt=deciles, value=decile_fun(x)))) 

Does one really have to go via Dict and then convert to DataFrame?

EDIT: Found the reason: I have to wrap DataFrame(...) into [ ] - now makes sense but is not very intuitive initially.

Still questions:

25.2.2 List-columns
at 25 Many models | R for Data Science

Sorry for the many questionsโ€ฆ Trying to understand if to convert to Julia for Data analysis.

If you have complex outputs and want to output into a single cell in the resulting DataFrame, just wrap the output in a Ref (or [] as you discovered).

I think itโ€™s never a great idea to have too complex a transformation in the โ€œmiddleโ€ part of the source => fun => dest minilanguage. Given you are defining an auxiliary function anyway Iโ€™d do:

julia> decile_df(x; decs = 0:0.1:1) = DataFrame("Q_" .* string.(decs) .=> quantile(x, decs))
decile_df (generic function with 1 method)

julia> combine(groupby(df, :gr), names(df, Number) .=> Ref โˆ˜ decile_df)
4ร—4 DataFrame
 Row โ”‚ gr    x1_Ref_decile_df  x2_Ref_decile_df  x3_Ref_decile_df 
     โ”‚ Char  DataFrame         DataFrame         DataFrame        
โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
   1 โ”‚ a     1ร—11 DataFrame    1ร—11 DataFrame    1ร—11 DataFrame   
   2 โ”‚ b     1ร—11 DataFrame    1ร—11 DataFrame    1ร—11 DataFrame   
   3 โ”‚ c     1ร—11 DataFrame    1ร—11 DataFrame    1ร—11 DataFrame   
   4 โ”‚ d     1ร—11 DataFrame    1ร—11 DataFrame    1ร—11 DataFrame  

Iโ€™m not as au fait with R anymore but I believe nest and unnest are stack and unstack.

And to answer your last bullet point - yes all of this is possible.

2 Likes

There are many questions, so let me try showing what I think you want (if I missed something please comment).

Variant 1: get for each variable and for each group a data frame with the result:

julia> combine(groupby(df, :gr), names(df, Number) .=> (x -> Ref(DataFrame(q=0.0:0.1:1.0, v=quantile(x, 0.0:0.1:1.0)))) => x -> x * "_DataFrame")
4ร—4 DataFrame
 Row โ”‚ gr    x1_DataFrame    x2_DataFrame    x3_DataFrame
     โ”‚ Char  DataFrame       DataFrame       DataFrame
โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
   1 โ”‚ A     11ร—2 DataFrame  11ร—2 DataFrame  11ร—2 DataFrame
   2 โ”‚ B     11ร—2 DataFrame  11ร—2 DataFrame  11ร—2 DataFrame
   3 โ”‚ C     11ร—2 DataFrame  11ร—2 DataFrame  11ร—2 DataFrame
   4 โ”‚ D     11ร—2 DataFrame  11ร—2 DataFrame  11ร—2 DataFrame

(instead of Ref you could wrap with [...] also, but Ref is a standard way in Base Julia broadcasting of turning any value into a scalar, so it is easier to remember)

Variant 2: expand the data frames into columns but still keeping the number of rows equal to number of groups:

julia> combine(groupby(df, :gr), names(df, Number) .=> (x -> Ref(DataFrame(q=0.0:0.1:1.0, v=quantile(x, 0.0:0.1:1.0)))) => x -> x .* ["_q", "_v"])
4ร—7 DataFrame
 Row โ”‚ gr    x1_q                               x1_v                               x2_q                               x2_v                               x3_q            โ‹ฏ
     โ”‚ Char  Arrayโ€ฆ                             Arrayโ€ฆ                             Arrayโ€ฆ                             Arrayโ€ฆ                             Arrayโ€ฆ          โ‹ฏ
โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
   1 โ”‚ A     [0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0โ€ฆ  [0.0100206, 0.105031, 0.146141, โ€ฆ  [0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0โ€ฆ  [0.026531, 0.106821, 0.25079, 0.โ€ฆ  [0.0, 0.1, 0.2, โ‹ฏ
   2 โ”‚ B     [0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0โ€ฆ  [0.0201708, 0.14565, 0.178781, 0โ€ฆ  [0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0โ€ฆ  [0.0379883, 0.0499197, 0.163857,โ€ฆ  [0.0, 0.1, 0.2,
   3 โ”‚ C     [0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0โ€ฆ  [0.0320945, 0.12889, 0.188354, 0โ€ฆ  [0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0โ€ฆ  [0.0560529, 0.166095, 0.221287, โ€ฆ  [0.0, 0.1, 0.2,
   4 โ”‚ D     [0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0โ€ฆ  [0.0154812, 0.046749, 0.167502, โ€ฆ  [0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0โ€ฆ  [0.0325466, 0.15281, 0.280586, 0โ€ฆ  [0.0, 0.1, 0.2,

Variant 3: as variant 2, but expand to as many rows as quantiles (for each variable keep a separate quantile column as in general it could be different)

julia> combine(groupby(df, :gr), names(df, Number) .=> (x -> DataFrame(q=0.0:0.1:1.0, v=quantile(x, 0.0:0.1:1.0))) => x -> x .* ["_q", "_v"])
44ร—7 DataFrame
 Row โ”‚ gr    x1_q     x1_v       x2_q     x2_v       x3_q     x3_v
     โ”‚ Char  Float64  Float64    Float64  Float64    Float64  Float64
โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
   1 โ”‚ A         0.0  0.0100206      0.0  0.026531       0.0  0.0304922
   2 โ”‚ A         0.1  0.105031       0.1  0.106821       0.1  0.116344
   3 โ”‚ A         0.2  0.146141       0.2  0.25079        0.2  0.160595
   4 โ”‚ A         0.3  0.239598       0.3  0.275699       0.3  0.223479
   5 โ”‚ A         0.4  0.418623       0.4  0.391514       0.4  0.283464
   6 โ”‚ A         0.5  0.479909       0.5  0.463614       0.5  0.350202
   7 โ”‚ A         0.6  0.661491       0.6  0.478091       0.6  0.421991
   8 โ”‚ A         0.7  0.709587       0.7  0.626841       0.7  0.464356
   9 โ”‚ A         0.8  0.778766       0.8  0.721748       0.8  0.581408
  10 โ”‚ A         0.9  0.922159       0.9  0.941598       0.9  0.762324
  11 โ”‚ A         1.0  0.986275       1.0  0.995137       1.0  0.923933
  12 โ”‚ B         0.0  0.0201708      0.0  0.0379883      0.0  0.0213256
  13 โ”‚ B         0.1  0.14565        0.1  0.0499197      0.1  0.163012
  โ‹ฎ  โ”‚  โ‹ฎ       โ‹ฎ         โ‹ฎ         โ‹ฎ         โ‹ฎ         โ‹ฎ         โ‹ฎ
  33 โ”‚ C         1.0  0.976539       1.0  0.902687       1.0  0.838742
  34 โ”‚ D         0.0  0.0154812      0.0  0.0325466      0.0  0.0327547
  35 โ”‚ D         0.1  0.046749       0.1  0.15281        0.1  0.187093
  36 โ”‚ D         0.2  0.167502       0.2  0.280586       0.2  0.31834
  37 โ”‚ D         0.3  0.236399       0.3  0.328495       0.3  0.389418
  38 โ”‚ D         0.4  0.31478        0.4  0.452312       0.4  0.428514
  39 โ”‚ D         0.5  0.325001       0.5  0.527466       0.5  0.526287
  40 โ”‚ D         0.6  0.448259       0.6  0.591354       0.6  0.546437
  41 โ”‚ D         0.7  0.573627       0.7  0.672257       0.7  0.613559
  42 โ”‚ D         0.8  0.802225       0.8  0.720313       0.8  0.730664
  43 โ”‚ D         0.9  0.893947       0.9  0.890409       0.9  0.908996
  44 โ”‚ D         1.0  0.931859       1.0  0.949966       1.0  0.977479
                                                         19 rows omitted

Variant 4: as variant 3, but single quantile column

julia> combine(groupby(df, :gr), names(df, Number) .=> (x -> quantile(x, 0.0:0.1:1.0)) => x -> x .* "_v", Returns((q=0.0:0.1:1.0,)))
44ร—5 DataFrame
 Row โ”‚ gr    x1_v       x2_v       x3_v       q
     โ”‚ Char  Float64    Float64    Float64    Float64
โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
   1 โ”‚ A     0.0100206  0.026531   0.0304922      0.0
   2 โ”‚ A     0.105031   0.106821   0.116344       0.1
   3 โ”‚ A     0.146141   0.25079    0.160595       0.2
   4 โ”‚ A     0.239598   0.275699   0.223479       0.3
   5 โ”‚ A     0.418623   0.391514   0.283464       0.4
   6 โ”‚ A     0.479909   0.463614   0.350202       0.5
   7 โ”‚ A     0.661491   0.478091   0.421991       0.6
   8 โ”‚ A     0.709587   0.626841   0.464356       0.7
   9 โ”‚ A     0.778766   0.721748   0.581408       0.8
  10 โ”‚ A     0.922159   0.941598   0.762324       0.9
  11 โ”‚ A     0.986275   0.995137   0.923933       1.0
  12 โ”‚ B     0.0201708  0.0379883  0.0213256      0.0
  13 โ”‚ B     0.14565    0.0499197  0.163012       0.1
  โ‹ฎ  โ”‚  โ‹ฎ        โ‹ฎ          โ‹ฎ          โ‹ฎ         โ‹ฎ
  33 โ”‚ C     0.976539   0.902687   0.838742       1.0
  34 โ”‚ D     0.0154812  0.0325466  0.0327547      0.0
  35 โ”‚ D     0.046749   0.15281    0.187093       0.1
  36 โ”‚ D     0.167502   0.280586   0.31834        0.2
  37 โ”‚ D     0.236399   0.328495   0.389418       0.3
  38 โ”‚ D     0.31478    0.452312   0.428514       0.4
  39 โ”‚ D     0.325001   0.527466   0.526287       0.5
  40 โ”‚ D     0.448259   0.591354   0.546437       0.6
  41 โ”‚ D     0.573627   0.672257   0.613559       0.7
  42 โ”‚ D     0.802225   0.720313   0.730664       0.8
  43 โ”‚ D     0.893947   0.890409   0.908996       0.9
  44 โ”‚ D     0.931859   0.949966   0.977479       1.0
                                       19 rows omitted

(here note the way to return a constant column not depending on anything - you create a named tuple with the column name you want and just wrap it in Returns.

1 Like