Hi,
I am trying to generate random numbers in a DataFrame that would be identical for the same categories.
for example:
│ Row │ id │ type │ mean │ std │
│ │ String │ String │ Float64 │ Float64 │
├─────┼────────┼────────┼─────────┼─────────┤
│ 1 │ A │ typeA │ 0.5 │ 0.2 │
│ 2 │ B │ typeA │ 0.5 │ 0.2 │
│ 3 │ C │ typeB │ 0.3 │ 0.1 │
│ 4 │ D │ typeB │ 0.3 │ 0.1 │
so I can generate a random number for each row like so:
d.rng1 = rand.(Normal.(d.mean, d.std))
│ Row │ id │ type │ mean │ std │ rng1 │
│ │ String │ String │ Float64 │ Float64 │ Float64 │
├─────┼────────┼────────┼─────────┼─────────┼──────────┤
│ 1 │ A │ typeA │ 0.5 │ 0.2 │ 0.265455 │
│ 2 │ B │ typeA │ 0.5 │ 0.2 │ 0.59307 │
│ 3 │ C │ typeB │ 0.3 │ 0.1 │ 0.310257 │
│ 4 │ D │ typeB │ 0.3 │ 0.1 │ 0.305229 │
but how to generate something that would look like column rng2 below based on the similarity in the column type? Is there a (fast) way to do it without creating a subset?
│ Row │ id │ type │ mean │ std │ rng1 │ rng2 │
│ │ String │ String │ Float64 │ Float64 │ Float64 │ Float64 │
├─────┼────────┼────────┼─────────┼─────────┼──────────┼──────────┤
│ 1 │ A │ typeA │ 0.5 │ 0.2 │ 0.265455 │ 0.265455 │
│ 2 │ B │ typeA │ 0.5 │ 0.2 │ 0.59307 │ 0.265455 │
│ 3 │ C │ typeB │ 0.3 │ 0.1 │ 0.310257 │ 0.310257 │
│ 4 │ D │ typeB │ 0.3 │ 0.1 │ 0.305229 │ 0.310257 │
Thanks
Seed the rng using the hash of the common key
reproducable
julia> transform!(d, :type => ByRow(t->rand(MersenneTwister(hash(t)))) => :rng)
50×2 DataFrame
Row │ type rng
│ String Float64
─────┼───────────────────
1 │ type0 0.878381
2 │ type5 0.92784
3 │ type6 0.625461
4 │ type7 0.141122
5 │ type8 0.847776
6 │ type8 0.847776
7 │ type8 0.847776
8 │ type4 0.841166
9 │ type4 0.841166
10 │ type2 0.905406
random
julia> salt=round(Int, 10000rand()); transform!(d, :type => ByRow(t->rand(MersenneTwister(hash(t)+salt))) => :rng)
50×2 DataFrame
Row │ type rng
│ String Float64
─────┼───────────────────
1 │ type0 0.105401
2 │ type5 0.165727
3 │ type6 0.261818
4 │ type7 0.0470375
5 │ type8 0.56809
6 │ type8 0.56809
7 │ type8 0.56809
8 │ type4 0.399268
9 │ type4 0.399268
10 │ type2 0.533638
1 Like
nilshg
December 17, 2021, 8:59am
4
If I understand you correctly you just want to draw one random number per group (rather than per row) and have that show up in all rows of that group? If so this should do it:
julia> transform!(groupby(df, :type), :id => (x -> rand()) => :rng)
4×3 DataFrame
Row │ id type rng
│ String Int64 Float64
─────┼─────────────────────────
1 │ A 1 0.678716
2 │ B 1 0.678716
3 │ C 2 0.71421
4 │ D 2 0.71421
3 Likes
bkamins
December 17, 2021, 10:51am
5
Yes, or:
transform!(groupby(df, :type), [] => rand => :rng)
which is a bit shorter to type.
2 Likes
And if I wanted to use the mean and std columns as
Normal.(d.mean, d.std)
How would you write down
transform!(groupby(df, :type), [] => rand => :rng)
?
Thanks a lot for the help
nilshg
December 17, 2021, 11:29am
7
Oh wow, there’s always new things in the minilanguage to discover! I had tried
:id => rand => :rng
initially but that of course won’t work because it essentially calls rand(x, length(x)
on each subgroup-vector x
(so essentially samples a random id
withing the group.
Is the use of []
as column selector documented somewhere?
nilshg
December 17, 2021, 11:33am
8
You can do
transform!(groupby(df, :type), [:mean, :std] => ((mean, std) -> rand(Normal(first(mean), first(std)))) => :rng)
bkamins
December 17, 2021, 11:56am
9
It is just a vector selector, just like e.g. [:a, :b]
or any other column selector. Just that it is am empty vector.
If it feels unintuitive to you use:
transform!(groupby(df, :type), Cols() => rand => :rng)
and now I hope it is clear that Cols()
selects no columns (I use []
as it is shorter to type ).
1 Like
nilshg
December 17, 2021, 11:57am
10
Makes sense - I guess I never came across
julia> select(df, [])
0×0 DataFrame
(as I suppose it is indeed a rarely used selector…)
bkamins
December 17, 2021, 12:17pm
11
or just:
julia> df[:, []]
0×0 DataFrame
Note that the same works with AbstractArray
s:
julia> x = [1, 2, 3]
3-element Vector{Int64}:
1
2
3
julia> x[[]]
Int64[]
julia> x = [1 2; 3 4]
2×2 Matrix{Int64}:
1 2
3 4
julia> x[:, []]
2×0 Matrix{Int64}
so, as usual, you get what Julia Base ships.
1 Like