Float.(randn(dims)) or randn(Float32, dims)

Found a behaviour of randn that I didn’t really understand. Since there exists a version where you can select the type I assume that that would be more efficient than generating and then converting, but it seems that this depends on the size of the generated data.

Here we see f1 using the conversion and f2 using the direct call. f1 is clearly slower when generating only 10 numbers, but a little faster when generating 2000 numbers. Anyone have a clue as to why this is?

julia> using BenchmarkTools

julia> f1(dims) = Float32.(randn(dims))
f1 (generic function with 2 methods)

julia> f2(dims) = randn(Float32, dims)
f2 (generic function with 2 methods)

julia> @benchmark f1(10)
BenchmarkTools.Trial: 10000 samples with 919 evaluations.
 Range (min … max):  107.929 ns …  1.170 μs  ┊ GC (min … max): 0.00% … 84.78%
 Time  (median):     118.046 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   124.795 ns ± 55.602 ns  ┊ GC (mean ± σ):  3.12% ±  6.36%

  ▃▆▇█▇▅▃▂ ▁                                                   ▂
  ███████████▇▇▄▆▄▅▅▄▄▄▃▁▁▃▄▁▁▃▁▁▁▁▁▁▁▃▁▃▁▄▆▆▆▇▆▆▅▅▆▅▅▆▅▃▆▄▄▃▆ █
  108 ns        Histogram: log(frequency) by time       281 ns <

 Memory estimate: 288 bytes, allocs estimate: 2.

julia> @benchmark f2(10)
BenchmarkTools.Trial: 10000 samples with 968 evaluations.
 Range (min … max):  71.681 ns … 770.913 ns  ┊ GC (min … max): 0.00% … 88.68%
 Time  (median):     78.347 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   83.177 ns ±  34.070 ns  ┊ GC (mean ± σ):  1.95% ±  4.54%

   ▁▃▇█▆▅▄▃▁▁                                                  ▂
  ▇██████████▇▇▆▆▆▅▆▅▆▇▆▆▆▅▄▄▄▃▁▁▃▄▁▇▇▇▇▇▇▇▇▆▆▆▆▆▆▆▄▆▆▃▃▆▆▅▆▇▇ █
  71.7 ns       Histogram: log(frequency) by time       161 ns <

 Memory estimate: 128 bytes, allocs estimate: 1.

julia> @benchmark f1(2000)
BenchmarkTools.Trial: 10000 samples with 4 evaluations.
 Range (min … max):  6.650 μs … 233.773 μs  ┊ GC (min … max): 0.00% … 80.82%
 Time  (median):     7.554 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   8.138 μs ±   8.375 μs  ┊ GC (mean ± σ):  4.01% ±  3.77%

         ▁▄▆▇██▆▄▃▁                                            
  ▁▁▁▂▂▃▇██████████▇▆▆▅▅▄▄▃▃▃▃▃▃▃▃▃▃▃▂▂▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▃
  6.65 μs         Histogram: frequency by time        10.6 μs <

 Memory estimate: 23.69 KiB, allocs estimate: 2.

julia> @benchmark f2(2000)
BenchmarkTools.Trial: 10000 samples with 4 evaluations.
 Range (min … max):  7.191 μs … 276.534 μs  ┊ GC (min … max): 0.00% … 86.09%
 Time  (median):     7.627 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   8.019 μs ±   6.315 μs  ┊ GC (mean ± σ):  1.70% ±  2.12%

   ▂▅▆███▇▇▆▅▄▄▄▃▄▃▃▃▃▂▂▂▁▁ ▁                                 ▃
  ▆████████████████████████████████▇▆▇▇▆▅▇▆▇▇█▇▇▇▇▆▇▇▇▇▆▇▇▆█▇ █
  7.19 μs      Histogram: log(frequency) by time      10.8 μs <

 Memory estimate: 7.94 KiB, allocs estimate: 1.

Just a guess, but maybe the randn is a bit less efficient for Float32 than for Float64

julia> @benchmark randn(Float32, 20000)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  57.300 μs …  4.728 ms  ┊ GC (min … max): 0.00% … 98.41%
 Time  (median):     60.500 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   70.436 μs ± 83.550 μs  ┊ GC (mean ± σ):  3.27% ±  3.72%

    █▅           
  ▂███▇▃▂▂▂▁▂▂▁▁▁▂▂▂▂▂▂▂▂▁▂▂▄▅▅▅▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂ ▃
  57.3 μs         Histogram: frequency by time         112 μs <

 Memory estimate: 78.20 KiB, allocs estimate: 2.

julia> @benchmark randn(Float64, 20000)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  49.600 μs …  3.921 ms  ┊ GC (min … max): 0.00% … 97.23%
 Time  (median):     52.800 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   65.426 μs ± 90.596 μs  ┊ GC (mean ± σ):  5.09% ±  4.92%

  ▃██▅▁                         ▃▄▄▄▃▂▁                       ▂
  █████▇▅▃▄▁▅▆▅▆▄▅▄▄▃▅▃▄▁▃▃▅▁▅▁▅████████▇▆▆▆▅▆▅▄▅▅▅▅▆▆▆▆▆▅▅▆▅ █
  49.6 μs      Histogram: log(frequency) by time       136 μs <

 Memory estimate: 156.33 KiB, allocs estimate: 2.

The difference is also comparable to the one in your experiment.

2 Likes

I agree that this must be the case, I was just curious as to why there was such a difference, especially to the degree that it would be faster to generate Float64 random numbers and convert them to Float32 instead of directly generating them as Float32.

Not sure why I didn’t check the source directly to begin with, but it was quite clear there. There is an optimization for MersenneTwister (default rng in 1.6) together with Float64 specifically:

        # filling arrays
        function $randfun!(rng::AbstractRNG, A::AbstractArray{T}) where T
            for i in eachindex(A)
                @inbounds A[i] = $randfun(rng, T)
            end
            A
        end

        # optimization for MersenneTwister, which randomizes natively Array{Float64}
        function $randfun!(rng::MersenneTwister, A::Array{Float64})
            if length(A) < 13
                for i in eachindex(A)
                    @inbounds A[i] = $randfun(rng, Float64)
                end
            else
                rand!(rng, A, CloseOpen12())
                for i in eachindex(A)
                    @inbounds A[i] = $_randfun(rng, reinterpret(UInt64, A[i]))
                end
            end
            A
        end

Running the same thing with for example StableRNGs rng type will be faster for the randn(rng, Float32, dims) case.

2 Likes

Curious to see how this changes in 1.7 with the new Xoroshiro RNG. Does that change also impact floating points?

I think that the comparison of times in the posted example may be misleading, because the random number generators are called with different seeds every time. I have observed time differences f1 and f2 of the same order of magnitude as between succesive calls to f1, for instance.

It would be better if each @benchmark is preceded by Random.seed! with the same number.

1 Like

Seems current master has something similar for Xoshiro and Float64.

randn will very rarely branch out to some extra computations from what I saw in the code, so this is certainly a possibility, but one that should be removed over many samples. Since @benchmark will run 10000 samples I think looking at the minimum (or maybe median) will still give a good estimate of how fast it can be. I did run some of them a few times and those numbers seemed consistent for me then, but of course it can never hurt to seed the test as you mention.