Identical Random number generation in a DataFrame based on row categories

CompulsoryCoffee · December 17, 2021, 8:16am

Hi,

I am trying to generate random numbers in a DataFrame that would be identical for the same categories.

for example:

│ Row │ id     │ type   │ mean    │ std     │
│     │ String │ String │ Float64 │ Float64 │
├─────┼────────┼────────┼─────────┼─────────┤
│ 1   │ A      │ typeA  │ 0.5     │ 0.2     │
│ 2   │ B      │ typeA  │ 0.5     │ 0.2     │
│ 3   │ C      │ typeB  │ 0.3     │ 0.1     │
│ 4   │ D      │ typeB  │ 0.3     │ 0.1     │

so I can generate a random number for each row like so:

d.rng1 = rand.(Normal.(d.mean, d.std))

│ Row │ id     │ type   │ mean    │ std     │ rng1     │
│     │ String │ String │ Float64 │ Float64 │ Float64  │
├─────┼────────┼────────┼─────────┼─────────┼──────────┤
│ 1   │ A      │ typeA  │ 0.5     │ 0.2     │ 0.265455 │
│ 2   │ B      │ typeA  │ 0.5     │ 0.2     │ 0.59307  │
│ 3   │ C      │ typeB  │ 0.3     │ 0.1     │ 0.310257 │
│ 4   │ D      │ typeB  │ 0.3     │ 0.1     │ 0.305229 │

but how to generate something that would look like column rng2 below based on the similarity in the column type? Is there a (fast) way to do it without creating a subset?

│ Row │ id     │ type   │ mean    │ std     │ rng1     │ rng2     │
│     │ String │ String │ Float64 │ Float64 │ Float64  │ Float64  │
├─────┼────────┼────────┼─────────┼─────────┼──────────┼──────────┤
│ 1   │ A      │ typeA  │ 0.5     │ 0.2     │ 0.265455 │ 0.265455 │
│ 2   │ B      │ typeA  │ 0.5     │ 0.2     │ 0.59307  │ 0.265455 │
│ 3   │ C      │ typeB  │ 0.3     │ 0.1     │ 0.310257 │ 0.310257 │
│ 4   │ D      │ typeB  │ 0.3     │ 0.1     │ 0.305229 │ 0.310257 │

Thanks

lawless-m · December 17, 2021, 8:41am

Seed the rng using the hash of the common key

reproducable

julia> transform!(d, :type => ByRow(t->rand(MersenneTwister(hash(t)))) => :rng)
50×2 DataFrame
 Row │ type    rng
     │ String  Float64
─────┼───────────────────
   1 │ type0   0.878381
   2 │ type5   0.92784
   3 │ type6   0.625461
   4 │ type7   0.141122
   5 │ type8   0.847776
   6 │ type8   0.847776
   7 │ type8   0.847776
   8 │ type4   0.841166
   9 │ type4   0.841166
  10 │ type2   0.905406

random

julia> salt=round(Int, 10000rand()); transform!(d, :type => ByRow(t->rand(MersenneTwister(hash(t)+salt))) => :rng)
50×2 DataFrame
 Row │ type    rng
     │ String  Float64
─────┼───────────────────
   1 │ type0   0.105401
   2 │ type5   0.165727
   3 │ type6   0.261818
   4 │ type7   0.0470375
   5 │ type8   0.56809
   6 │ type8   0.56809
   7 │ type8   0.56809
   8 │ type4   0.399268
   9 │ type4   0.399268
  10 │ type2   0.533638

nilshg · December 17, 2021, 8:59am

If I understand you correctly you just want to draw one random number per group (rather than per row) and have that show up in all rows of that group? If so this should do it:

julia> transform!(groupby(df, :type), :id => (x -> rand()) => :rng)
4×3 DataFrame
 Row │ id      type   rng      
     │ String  Int64  Float64  
─────┼─────────────────────────
   1 │ A           1  0.678716
   2 │ B           1  0.678716
   3 │ C           2  0.71421
   4 │ D           2  0.71421

bkamins · December 17, 2021, 10:51am

Yes, or:

transform!(groupby(df, :type), [] => rand => :rng)

which is a bit shorter to type.

CompulsoryCoffee · December 17, 2021, 11:21am

And if I wanted to use the mean and std columns as

Normal.(d.mean, d.std)

How would you write down

transform!(groupby(df, :type), [] => rand => :rng)

?

Thanks a lot for the help

nilshg · December 17, 2021, 11:29am

Oh wow, there’s always new things in the minilanguage to discover! I had tried

:id => rand => :rng

initially but that of course won’t work because it essentially calls rand(x, length(x) on each subgroup-vector x(so essentially samples a random id withing the group.

Is the use of [] as column selector documented somewhere?

nilshg · December 17, 2021, 11:33am

You can do

transform!(groupby(df, :type), [:mean, :std] => ((mean, std) -> rand(Normal(first(mean), first(std)))) => :rng)

bkamins · December 17, 2021, 11:56am

It is just a vector selector, just like e.g. [:a, :b] or any other column selector. Just that it is am empty vector.

If it feels unintuitive to you use:

transform!(groupby(df, :type), Cols() => rand => :rng)

and now I hope it is clear that Cols() selects no columns (I use [] as it is shorter to type ).

nilshg · December 17, 2021, 11:57am

Makes sense - I guess I never came across

julia> select(df, [])
0×0 DataFrame

(as I suppose it is indeed a rarely used selector…)

bkamins · December 17, 2021, 12:17pm

or just:

julia> df[:, []]
0×0 DataFrame

Note that the same works with AbstractArrays:

julia> x = [1, 2, 3]
3-element Vector{Int64}:
 1
 2
 3

julia> x[[]]
Int64[]

julia> x = [1 2; 3 4]
2×2 Matrix{Int64}:
 1  2
 3  4

julia> x[:, []]
2×0 Matrix{Int64}

so, as usual, you get what Julia Base ships.

Topic		Replies	Views
Efficient generation of random number from 2 columns in a dataframe General Usage question , dataframes	4	522	September 29, 2021
How do I create an array with random unique numbers in a specific range? New to Julia question , random	20	1036	June 24, 2024
How to get the same "random" numbers on Julia and R. seed General Usage	11	8487	August 30, 2021
Can rand generate random vectors with custom type entries? General Usage random	5	466	December 2, 2023
Same random seed, but different random numbers? General Usage	11	2490	July 5, 2022

Identical Random number generation in a DataFrame based on row categories

Related topics