Identical Random number generation in a DataFrame based on row categories

Hi,

I am trying to generate random numbers in a DataFrame that would be identical for the same categories.

for example:

│ Row │ id     │ type   │ mean    │ std     │
│     │ String │ String │ Float64 │ Float64 │
├─────┼────────┼────────┼─────────┼─────────┤
│ 1   │ A      │ typeA  │ 0.5     │ 0.2     │
│ 2   │ B      │ typeA  │ 0.5     │ 0.2     │
│ 3   │ C      │ typeB  │ 0.3     │ 0.1     │
│ 4   │ D      │ typeB  │ 0.3     │ 0.1     │

so I can generate a random number for each row like so:

d.rng1 = rand.(Normal.(d.mean, d.std))

│ Row │ id     │ type   │ mean    │ std     │ rng1     │
│     │ String │ String │ Float64 │ Float64 │ Float64  │
├─────┼────────┼────────┼─────────┼─────────┼──────────┤
│ 1   │ A      │ typeA  │ 0.5     │ 0.2     │ 0.265455 │
│ 2   │ B      │ typeA  │ 0.5     │ 0.2     │ 0.59307  │
│ 3   │ C      │ typeB  │ 0.3     │ 0.1     │ 0.310257 │
│ 4   │ D      │ typeB  │ 0.3     │ 0.1     │ 0.305229 │

but how to generate something that would look like column rng2 below based on the similarity in the column type? Is there a (fast) way to do it without creating a subset?

│ Row │ id     │ type   │ mean    │ std     │ rng1     │ rng2     │
│     │ String │ String │ Float64 │ Float64 │ Float64  │ Float64  │
├─────┼────────┼────────┼─────────┼─────────┼──────────┼──────────┤
│ 1   │ A      │ typeA  │ 0.5     │ 0.2     │ 0.265455 │ 0.265455 │
│ 2   │ B      │ typeA  │ 0.5     │ 0.2     │ 0.59307  │ 0.265455 │
│ 3   │ C      │ typeB  │ 0.3     │ 0.1     │ 0.310257 │ 0.310257 │
│ 4   │ D      │ typeB  │ 0.3     │ 0.1     │ 0.305229 │ 0.310257 │

Thanks

Seed the rng using the hash of the common key

reproducable

julia> transform!(d, :type => ByRow(t->rand(MersenneTwister(hash(t)))) => :rng)
50×2 DataFrame
 Row │ type    rng
     │ String  Float64
─────┼───────────────────
   1 │ type0   0.878381
   2 │ type5   0.92784
   3 │ type6   0.625461
   4 │ type7   0.141122
   5 │ type8   0.847776
   6 │ type8   0.847776
   7 │ type8   0.847776
   8 │ type4   0.841166
   9 │ type4   0.841166
  10 │ type2   0.905406

random

julia> salt=round(Int, 10000rand()); transform!(d, :type => ByRow(t->rand(MersenneTwister(hash(t)+salt))) => :rng)
50×2 DataFrame
 Row │ type    rng
     │ String  Float64
─────┼───────────────────
   1 │ type0   0.105401
   2 │ type5   0.165727
   3 │ type6   0.261818
   4 │ type7   0.0470375
   5 │ type8   0.56809
   6 │ type8   0.56809
   7 │ type8   0.56809
   8 │ type4   0.399268
   9 │ type4   0.399268
  10 │ type2   0.533638
1 Like

If I understand you correctly you just want to draw one random number per group (rather than per row) and have that show up in all rows of that group? If so this should do it:

julia> transform!(groupby(df, :type), :id => (x -> rand()) => :rng)
4×3 DataFrame
 Row │ id      type   rng      
     │ String  Int64  Float64  
─────┼─────────────────────────
   1 │ A           1  0.678716
   2 │ B           1  0.678716
   3 │ C           2  0.71421
   4 │ D           2  0.71421
3 Likes

Yes, or:

transform!(groupby(df, :type), [] => rand => :rng)

which is a bit shorter to type.

2 Likes

And if I wanted to use the mean and std columns as

Normal.(d.mean, d.std)

How would you write down

transform!(groupby(df, :type), [] => rand => :rng)

?

Thanks a lot for the help

Oh wow, there’s always new things in the minilanguage to discover! I had tried

:id => rand => :rng

initially but that of course won’t work because it essentially calls rand(x, length(x) on each subgroup-vector x(so essentially samples a random id withing the group.

Is the use of [] as column selector documented somewhere?

You can do

transform!(groupby(df, :type), [:mean, :std] => ((mean, std) -> rand(Normal(first(mean), first(std)))) => :rng)

It is just a vector selector, just like e.g. [:a, :b] or any other column selector. Just that it is am empty vector.

If it feels unintuitive to you use:

transform!(groupby(df, :type), Cols() => rand => :rng)

and now I hope it is clear that Cols() selects no columns (I use [] as it is shorter to type :grinning_face_with_smiling_eyes:).

1 Like

Makes sense - I guess I never came across

julia> select(df, [])
0×0 DataFrame

(as I suppose it is indeed a rarely used selector…)

or just:

julia> df[:, []]
0×0 DataFrame

Note that the same works with AbstractArrays:

julia> x = [1, 2, 3]
3-element Vector{Int64}:
 1
 2
 3

julia> x[[]]
Int64[]

julia> x = [1 2; 3 4]
2×2 Matrix{Int64}:
 1  2
 3  4

julia> x[:, []]
2×0 Matrix{Int64}

so, as usual, you get what Julia Base ships.

1 Like