Why is `sort(x, by = _ -> rand())` not a good shuffler?

I learned this the hard way in that sort(x, by = _ -> rand()) is not a shuffler, i.e. x is not random enough after this shuffle.

Using Random.shuffle is the correct way it seems.

But why is sort(x, by = _ -> rand()) not that great at shuffling?

is it cos rand() runs too fast so the same random number get for some successive numbers? This can’t be the case since doesn’t every run of rand() generate a different number if the random seed is not reset.

res = mapreduce(vcat, 1:1000) do _

    reshape(sort(1:16, by = _ -> rand()), 1, :)

end

mean(res[:, 1]) # 2.5

res2 = mapreduce(vcat, 1:1000) do _

    reshape(shuffle(1:16), 1, :)

end

mean(res2[:, 1]) # 8.342

To see the effect, consider the above where I shuffled the numbers 1:16 using the 2 methods and calculated the mean of the first number. Clearning the first method is too low, meaning not enough small number get shuffled to the end.

I guess that the problem is precisely that the result of rand is changing on each run.

If you create a random column and sort by this colum It must be fine.

But as rand gives you something new on each comparison if the Sort algorithm, the element is doing a “random walk” sometimes it goes to the front and sometimes ir goes back and on average It ends more ore less where It starts

2 Likes

That makes sense, since the rand() is run every time and not fixed. I had thought for some reason taht the random number only get generated once.