# Tidier creating random columns

I’m trying to randomly split some data stratified by a reference vector. My idea was to group by each category in the reference column and then create a new random column using a Bernoulli random sample.

using Tidier, DataFrames, Distributions, Random

Random.seed!(123)

df = DataFrame(reference = rand(Bernoulli(0.6),10))

@chain df begin
@group_by(reference)
@mutate(split = rand(Bernoulli(0.8),n()))
@ungroup
end

Which produces:

10×2 DataFrame
Row │ reference  split
│ Bool       Array…
─────┼─────────────────────────────────────────
1 │      true  Bool[1, 0, 1, 1, 0, 1, 0, 0]
2 │      true  Bool[0, 1, 1, 1, 1, 1, 0, 1]
3 │      true  Bool[1, 1, 1, 1, 1, 0, 0, 0]
4 │      true  Bool[1, 1, 1, 1, 1, 1, 1, 1]
5 │      true  Bool[1, 1, 1, 0, 0, 1, 1, 1]
6 │      true  Bool[1, 1, 1, 1, 1, 1, 1, 1]
7 │      true  Bool[1, 1, 1, 1, 1, 1, 1, 0]
8 │      true  Bool[0, 1, 1, 0, 1, 0, 1, 1]
9 │     false  Bool[1, 1]
10 │     false  Bool[0, 1]

When I remove n() it produces the same random value for each instance of the group.

10×2 DataFrame
Row │ reference  split
│ Bool       Bool
─────┼──────────────────
1 │      true   true
2 │      true   true
3 │      true   true
4 │      true   true
5 │      true   true
6 │      true   true
7 │      true   true
8 │      true   true
9 │     false   true
10 │     false   true

What would be a way to get the desired result?

1 Like

Hi @RobertGregg, happy to help with this.

Can you share an example of what you would expect the result to look like? Would it be just another column with true and false values similar to the first one that also draws from a Bernoulli distribution?

1 Like

Ah sorry. I expected the output to be another true/false column drawn from the Bernoulli distribution. I’m realizing now this wasn’t exactly what I wanted. I came up with this solution:

@chain df begin
@group_by(reference)
@mutate(split = if_else(row_number() ≤ 0.8*n(),"training","testing"))
@mutate(split = ~shuffle(split))
@ungroup
end

which I think is a reasonable way to create a test/train split stratified by another variable.

I still don’t know how I would add a random variable column using @mutate . In R, something like mutate(out = runif(n())) would produce the desired result.

P.S. Thank you for the effort you’ve put into the Tidier ecosystem. It makes switching between R and Julia a breeze!

Thanks for sharing that.

Looking at this a bit more carefully, this is actually a bug in TidierData.jl. I think I know where it’s coming from, but we should be able to do this in much less clunky syntax than the workaround you came up with.

Going to file this as an issue on GitHub and will get this fixed.

Thanks for catching this.

1 Like

@RobertGregg, this is now fixed. It was a simple issue, but I’m glad you caught this – it addresses several issues with n(). I’ll get the new version up on the registry today.

Things that are fixed:

• n() pulls a single value and not a vector of values (was a bug)
• rand() is no longer auto-vectorized so you don’t need to write ~rand()

This should now work.

(I didn’t set the seed, so the results may be slightly different from what you had)

julia> df = DataFrame(reference = rand(Bernoulli(0.6),10))
10×1 DataFrame
Row │ reference
│ Bool
─────┼───────────
1 │     false
2 │     false
3 │      true
4 │      true
5 │     false
6 │      true
7 │      true
8 │     false
9 │      true
10 │      true

julia> @chain df begin
@group_by(reference)
@mutate(split = if_else(rand(n()) <= 0.8 ,"training","testing"))
@ungroup
end
10×2 DataFrame
Row │ reference  split
│ Bool       String
─────┼─────────────────────
1 │     false  training
2 │     false  training
3 │     false  testing
4 │     false  training
5 │      true  training
6 │      true  testing
7 │      true  training
8 │      true  training
9 │      true  training
10 │      true  training
2 Likes