Speed of sampling with replacement

Currently sampling with replacement in Julia seems to be significantly slower than in R (tested on Julia 0.6.0, 64bit). See e.g.:

> x <- runif(10^7)
> system.time(sample(x, replace=T))
   user  system elapsed 
   0.38    0.02    0.39 

vs.

julia> x = rand(10^7);

julia> @benchmark rand($x, 10^7)
BenchmarkTools.Trial:
  memory estimate:  76.29 MiB
  allocs estimate:  3
  --------------
  minimum time:     648.989 ms (0.03% GC)
  median time:      663.557 ms (1.52% GC)
  mean time:        675.775 ms (3.53% GC)
  maximum time:     724.599 ms (9.70% GC)
  --------------
  samples:          8
  evals/sample:     1

Sampling with replacement is a common operation so I believe it would be nice to speed it up.

I do not have a full solution, but a partial one could be to make function rand!(rng::AbstractRNG, A::AbstractArray, r::AbstractArray) check if length(r) is small (so that it fits into 32 bits) and then create RangeGenerator with a narrower domain. My preliminary tests show that this gives ~15% reduction of execution time.

Another approach could be:

julia> @benchmark $x[rand(1:10^7,10^7)]
BenchmarkTools.Trial:
  memory estimate:  152.59 MiB
  allocs estimate:  5
  --------------
  minimum time:     404.157 ms (2.95% GC)
  median time:      409.564 ms (3.99% GC)
  mean time:        420.688 ms (6.34% GC)
  maximum time:     479.673 ms (18.14% GC)
  --------------
  samples:          12
  evals/sample:     1

and

julia> @benchmark $x[rand(Int32(1):Int32(10^7),10^7)]
BenchmarkTools.Trial:
  memory estimate:  114.44 MiB
  allocs estimate:  5
  --------------
  minimum time:     329.700 ms (3.44% GC)
  median time:      338.070 ms (3.81% GC)
  mean time:        345.072 ms (6.22% GC)
  maximum time:     404.187 ms (20.81% GC)
  --------------
  samples:          15
  evals/sample:     1

Unfortunately they are more memory hungry (the last one is faster than R), but the question is why rand(x, 10^7) is so much slower?