Tidier creating random columns

RobertGregg · March 5, 2024, 10:54pm

I’m trying to randomly split some data stratified by a reference vector. My idea was to group by each category in the reference column and then create a new random column using a Bernoulli random sample.

using Tidier, DataFrames, Distributions, Random

Random.seed!(123)

df = DataFrame(reference = rand(Bernoulli(0.6),10))

@chain df begin
    @group_by(reference)
    @mutate(split = rand(Bernoulli(0.8),n()))
    @ungroup
end

Which produces:

10×2 DataFrame
 Row │ reference  split
     │ Bool       Array…
─────┼─────────────────────────────────────────
   1 │      true  Bool[1, 0, 1, 1, 0, 1, 0, 0]
   2 │      true  Bool[0, 1, 1, 1, 1, 1, 0, 1]
   3 │      true  Bool[1, 1, 1, 1, 1, 0, 0, 0]
   4 │      true  Bool[1, 1, 1, 1, 1, 1, 1, 1]
   5 │      true  Bool[1, 1, 1, 0, 0, 1, 1, 1]
   6 │      true  Bool[1, 1, 1, 1, 1, 1, 1, 1]
   7 │      true  Bool[1, 1, 1, 1, 1, 1, 1, 0]
   8 │      true  Bool[0, 1, 1, 0, 1, 0, 1, 1]
   9 │     false  Bool[1, 1]
  10 │     false  Bool[0, 1]

When I remove n() it produces the same random value for each instance of the group.

10×2 DataFrame
 Row │ reference  split 
     │ Bool       Bool  
─────┼──────────────────
   1 │      true   true
   2 │      true   true
   3 │      true   true
   4 │      true   true
   5 │      true   true
   6 │      true   true
   7 │      true   true
   8 │      true   true
   9 │     false   true
  10 │     false   true

What would be a way to get the desired result?

kdpsingh · March 6, 2024, 5:28am

Hi @RobertGregg, happy to help with this.

Can you share an example of what you would expect the result to look like? Would it be just another column with true and false values similar to the first one that also draws from a Bernoulli distribution?

RobertGregg · March 6, 2024, 6:21am

Ah sorry. I expected the output to be another true/false column drawn from the Bernoulli distribution. I’m realizing now this wasn’t exactly what I wanted. I came up with this solution:

@chain df begin
    @group_by(reference)
    @mutate(split = if_else(row_number() ≤ 0.8*n(),"training","testing"))
    @mutate(split = ~shuffle(split))
    @ungroup
end

which I think is a reasonable way to create a test/train split stratified by another variable.

I still don’t know how I would add a random variable column using @mutate . In R, something like mutate(out = runif(n())) would produce the desired result.

P.S. Thank you for the effort you’ve put into the Tidier ecosystem. It makes switching between R and Julia a breeze!

kdpsingh · March 6, 2024, 4:28pm

Thanks for sharing that.

Looking at this a bit more carefully, this is actually a bug in TidierData.jl. I think I know where it’s coming from, but we should be able to do this in much less clunky syntax than the workaround you came up with.

Going to file this as an issue on GitHub and will get this fixed.

Thanks for catching this.

kdpsingh · March 8, 2024, 5:32pm

@RobertGregg, this is now fixed. It was a simple issue, but I’m glad you caught this – it addresses several issues with n(). I’ll get the new version up on the registry today.

Things that are fixed:

n() pulls a single value and not a vector of values (was a bug)
rand() is no longer auto-vectorized so you don’t need to write ~rand()

This should now work.

(I didn’t set the seed, so the results may be slightly different from what you had)

julia> df = DataFrame(reference = rand(Bernoulli(0.6),10))
10×1 DataFrame
 Row │ reference 
     │ Bool      
─────┼───────────
   1 │     false
   2 │     false
   3 │      true
   4 │      true
   5 │     false
   6 │      true
   7 │      true
   8 │     false
   9 │      true
  10 │      true

julia> @chain df begin
           @group_by(reference)
           @mutate(split = if_else(rand(n()) <= 0.8 ,"training","testing"))
           @ungroup
       end
10×2 DataFrame
 Row │ reference  split    
     │ Bool       String   
─────┼─────────────────────
   1 │     false  training
   2 │     false  training
   3 │     false  testing
   4 │     false  training
   5 │      true  training
   6 │      true  testing
   7 │      true  training
   8 │      true  training
   9 │      true  training
  10 │      true  training

Topic		Replies	Views
Identical Random number generation in a DataFrame based on row categories New to Julia dataframes , random	9	451	December 17, 2021
Separating a column into a variable number of possible columns Data dataframes	22	1408	October 3, 2021
Efficient generation of random number from 2 columns in a dataframe General Usage question , dataframes	4	546	September 29, 2021
Creating a function to produce stratum indicators efficiently Data dataframes	4	218	September 13, 2022
Simple tool for train test split Machine Learning	12	11738	March 20, 2020

Tidier creating random columns

Related topics