I’m trying to randomly split some data stratified by a reference vector. My idea was to group by each category in the reference column and then create a new random column using a Bernoulli random sample.
using Tidier, DataFrames, Distributions, Random
Random.seed!(123)
df = DataFrame(reference = rand(Bernoulli(0.6),10))
@chain df begin
@group_by(reference)
@mutate(split = rand(Bernoulli(0.8),n()))
@ungroup
end
Which produces:
10×2 DataFrame
Row │ reference split
│ Bool Array…
─────┼─────────────────────────────────────────
1 │ true Bool[1, 0, 1, 1, 0, 1, 0, 0]
2 │ true Bool[0, 1, 1, 1, 1, 1, 0, 1]
3 │ true Bool[1, 1, 1, 1, 1, 0, 0, 0]
4 │ true Bool[1, 1, 1, 1, 1, 1, 1, 1]
5 │ true Bool[1, 1, 1, 0, 0, 1, 1, 1]
6 │ true Bool[1, 1, 1, 1, 1, 1, 1, 1]
7 │ true Bool[1, 1, 1, 1, 1, 1, 1, 0]
8 │ true Bool[0, 1, 1, 0, 1, 0, 1, 1]
9 │ false Bool[1, 1]
10 │ false Bool[0, 1]
When I remove n()
it produces the same random value for each instance of the group.
10×2 DataFrame
Row │ reference split
│ Bool Bool
─────┼──────────────────
1 │ true true
2 │ true true
3 │ true true
4 │ true true
5 │ true true
6 │ true true
7 │ true true
8 │ true true
9 │ false true
10 │ false true
What would be a way to get the desired result?