Hi,
I am trying to generate random numbers in a DataFrame that would be identical for the same categories.
for example:
│ Row │ id     │ type   │ mean    │ std     │
│     │ String │ String │ Float64 │ Float64 │
├─────┼────────┼────────┼─────────┼─────────┤
│ 1   │ A      │ typeA  │ 0.5     │ 0.2     │
│ 2   │ B      │ typeA  │ 0.5     │ 0.2     │
│ 3   │ C      │ typeB  │ 0.3     │ 0.1     │
│ 4   │ D      │ typeB  │ 0.3     │ 0.1     │
so I can generate a random number for each row like so:
d.rng1 = rand.(Normal.(d.mean, d.std))
│ Row │ id     │ type   │ mean    │ std     │ rng1     │
│     │ String │ String │ Float64 │ Float64 │ Float64  │
├─────┼────────┼────────┼─────────┼─────────┼──────────┤
│ 1   │ A      │ typeA  │ 0.5     │ 0.2     │ 0.265455 │
│ 2   │ B      │ typeA  │ 0.5     │ 0.2     │ 0.59307  │
│ 3   │ C      │ typeB  │ 0.3     │ 0.1     │ 0.310257 │
│ 4   │ D      │ typeB  │ 0.3     │ 0.1     │ 0.305229 │
but how to generate something that would look like column rng2 below based on the similarity in the column type? Is there a (fast) way to do it without creating a subset?
│ Row │ id     │ type   │ mean    │ std     │ rng1     │ rng2     │
│     │ String │ String │ Float64 │ Float64 │ Float64  │ Float64  │
├─────┼────────┼────────┼─────────┼─────────┼──────────┼──────────┤
│ 1   │ A      │ typeA  │ 0.5     │ 0.2     │ 0.265455 │ 0.265455 │
│ 2   │ B      │ typeA  │ 0.5     │ 0.2     │ 0.59307  │ 0.265455 │
│ 3   │ C      │ typeB  │ 0.3     │ 0.1     │ 0.310257 │ 0.310257 │
│ 4   │ D      │ typeB  │ 0.3     │ 0.1     │ 0.305229 │ 0.310257 │
Thanks
             
            
              
            
           
          
            
            
              Seed the rng using the hash of the common key
reproducable
julia> transform!(d, :type => ByRow(t->rand(MersenneTwister(hash(t)))) => :rng)
50×2 DataFrame
 Row │ type    rng
     │ String  Float64
─────┼───────────────────
   1 │ type0   0.878381
   2 │ type5   0.92784
   3 │ type6   0.625461
   4 │ type7   0.141122
   5 │ type8   0.847776
   6 │ type8   0.847776
   7 │ type8   0.847776
   8 │ type4   0.841166
   9 │ type4   0.841166
  10 │ type2   0.905406
random
julia> salt=round(Int, 10000rand()); transform!(d, :type => ByRow(t->rand(MersenneTwister(hash(t)+salt))) => :rng)
50×2 DataFrame
 Row │ type    rng
     │ String  Float64
─────┼───────────────────
   1 │ type0   0.105401
   2 │ type5   0.165727
   3 │ type6   0.261818
   4 │ type7   0.0470375
   5 │ type8   0.56809
   6 │ type8   0.56809
   7 │ type8   0.56809
   8 │ type4   0.399268
   9 │ type4   0.399268
  10 │ type2   0.533638
 
            
              1 Like 
            
            
           
          
            
              
                nilshg  
              
                  
                    December 17, 2021,  8:59am
                   
                  4 
               
             
            
              If I understand you correctly you just want to draw one random number per group (rather than per row) and have that show up in all rows of that group? If so this should do it:
julia> transform!(groupby(df, :type), :id => (x -> rand()) => :rng)
4×3 DataFrame
 Row │ id      type   rng      
     │ String  Int64  Float64  
─────┼─────────────────────────
   1 │ A           1  0.678716
   2 │ B           1  0.678716
   3 │ C           2  0.71421
   4 │ D           2  0.71421
 
            
              3 Likes 
            
            
           
          
            
              
                bkamins  
              
                  
                    December 17, 2021, 10:51am
                   
                  5 
               
             
            
              
Yes, or:
transform!(groupby(df, :type), [] => rand => :rng)
which is a bit shorter to type.
             
            
              2 Likes 
            
            
           
          
            
            
              And if I wanted to use the mean and std columns as
Normal.(d.mean, d.std)
How would you write down
transform!(groupby(df, :type), [] => rand => :rng)
?
Thanks a lot for the help
             
            
              
            
           
          
            
              
                nilshg  
              
                  
                    December 17, 2021, 11:29am
                   
                  7 
               
             
            
              Oh wow, there’s always new things in the minilanguage to discover! I had tried
:id => rand => :rng
initially but that of course won’t work because it essentially calls rand(x, length(x) on each subgroup-vector x(so essentially samples a random id withing the group.
Is the use of [] as column selector documented somewhere?
             
            
              
            
           
          
            
              
                nilshg  
              
                  
                    December 17, 2021, 11:33am
                   
                  8 
               
             
            
              You can do
transform!(groupby(df, :type), [:mean, :std] => ((mean, std) -> rand(Normal(first(mean), first(std)))) => :rng)
 
            
              
            
           
          
            
              
                bkamins  
              
                  
                    December 17, 2021, 11:56am
                   
                  9 
               
             
            
              
It is just a vector selector, just like e.g. [:a, :b] or any other column selector. Just that it is am empty vector.
If it feels unintuitive to you use:
transform!(groupby(df, :type), Cols() => rand => :rng)
and now I hope it is clear that Cols() selects no columns (I use [] as it is shorter to type 
             
            
              1 Like 
            
            
           
          
            
              
                nilshg  
              
                  
                    December 17, 2021, 11:57am
                   
                  10 
               
             
            
              Makes sense - I guess I never came across
julia> select(df, [])
0×0 DataFrame
(as I suppose it is indeed a rarely used selector…)
             
            
              
            
           
          
            
              
                bkamins  
              
                  
                    December 17, 2021, 12:17pm
                   
                  11 
               
             
            
              or just:
julia> df[:, []]
0×0 DataFrame
Note that the same works with AbstractArrays:
julia> x = [1, 2, 3]
3-element Vector{Int64}:
 1
 2
 3
julia> x[[]]
Int64[]
julia> x = [1 2; 3 4]
2×2 Matrix{Int64}:
 1  2
 3  4
julia> x[:, []]
2×0 Matrix{Int64}
so, as usual, you get what Julia Base ships.
             
            
              1 Like