Found a behaviour of randn that I didn’t really understand. Since there exists a version where you can select the type I assume that that would be more efficient than generating and then converting, but it seems that this depends on the size of the generated data.
Here we see f1 using the conversion and f2 using the direct call. f1 is clearly slower when generating only 10 numbers, but a little faster when generating 2000 numbers. Anyone have a clue as to why this is?
julia> using BenchmarkTools
julia> f1(dims) = Float32.(randn(dims))
f1 (generic function with 2 methods)
julia> f2(dims) = randn(Float32, dims)
f2 (generic function with 2 methods)
julia> @benchmark f1(10)
BenchmarkTools.Trial: 10000 samples with 919 evaluations.
Range (min … max): 107.929 ns … 1.170 μs ┊ GC (min … max): 0.00% … 84.78%
Time (median): 118.046 ns ┊ GC (median): 0.00%
Time (mean ± σ): 124.795 ns ± 55.602 ns ┊ GC (mean ± σ): 3.12% ± 6.36%
▃▆▇█▇▅▃▂ ▁ ▂
███████████▇▇▄▆▄▅▅▄▄▄▃▁▁▃▄▁▁▃▁▁▁▁▁▁▁▃▁▃▁▄▆▆▆▇▆▆▅▅▆▅▅▆▅▃▆▄▄▃▆ █
108 ns Histogram: log(frequency) by time 281 ns <
Memory estimate: 288 bytes, allocs estimate: 2.
julia> @benchmark f2(10)
BenchmarkTools.Trial: 10000 samples with 968 evaluations.
Range (min … max): 71.681 ns … 770.913 ns ┊ GC (min … max): 0.00% … 88.68%
Time (median): 78.347 ns ┊ GC (median): 0.00%
Time (mean ± σ): 83.177 ns ± 34.070 ns ┊ GC (mean ± σ): 1.95% ± 4.54%
▁▃▇█▆▅▄▃▁▁ ▂
▇██████████▇▇▆▆▆▅▆▅▆▇▆▆▆▅▄▄▄▃▁▁▃▄▁▇▇▇▇▇▇▇▇▆▆▆▆▆▆▆▄▆▆▃▃▆▆▅▆▇▇ █
71.7 ns Histogram: log(frequency) by time 161 ns <
Memory estimate: 128 bytes, allocs estimate: 1.
julia> @benchmark f1(2000)
BenchmarkTools.Trial: 10000 samples with 4 evaluations.
Range (min … max): 6.650 μs … 233.773 μs ┊ GC (min … max): 0.00% … 80.82%
Time (median): 7.554 μs ┊ GC (median): 0.00%
Time (mean ± σ): 8.138 μs ± 8.375 μs ┊ GC (mean ± σ): 4.01% ± 3.77%
▁▄▆▇██▆▄▃▁
▁▁▁▂▂▃▇██████████▇▆▆▅▅▄▄▃▃▃▃▃▃▃▃▃▃▃▂▂▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▃
6.65 μs Histogram: frequency by time 10.6 μs <
Memory estimate: 23.69 KiB, allocs estimate: 2.
julia> @benchmark f2(2000)
BenchmarkTools.Trial: 10000 samples with 4 evaluations.
Range (min … max): 7.191 μs … 276.534 μs ┊ GC (min … max): 0.00% … 86.09%
Time (median): 7.627 μs ┊ GC (median): 0.00%
Time (mean ± σ): 8.019 μs ± 6.315 μs ┊ GC (mean ± σ): 1.70% ± 2.12%
▂▅▆███▇▇▆▅▄▄▄▃▄▃▃▃▃▂▂▂▁▁ ▁ ▃
▆████████████████████████████████▇▆▇▇▆▅▇▆▇▇█▇▇▇▇▆▇▇▇▇▆▇▇▆█▇ █
7.19 μs Histogram: log(frequency) by time 10.8 μs <
Memory estimate: 7.94 KiB, allocs estimate: 1.
I agree that this must be the case, I was just curious as to why there was such a difference, especially to the degree that it would be faster to generate Float64 random numbers and convert them to Float32 instead of directly generating them as Float32.
Not sure why I didn’t check the source directly to begin with, but it was quite clear there. There is an optimization for MersenneTwister (default rng in 1.6) together with Float64 specifically:
# filling arrays
function $randfun!(rng::AbstractRNG, A::AbstractArray{T}) where T
for i in eachindex(A)
@inbounds A[i] = $randfun(rng, T)
end
A
end
# optimization for MersenneTwister, which randomizes natively Array{Float64}
function $randfun!(rng::MersenneTwister, A::Array{Float64})
if length(A) < 13
for i in eachindex(A)
@inbounds A[i] = $randfun(rng, Float64)
end
else
rand!(rng, A, CloseOpen12())
for i in eachindex(A)
@inbounds A[i] = $_randfun(rng, reinterpret(UInt64, A[i]))
end
end
A
end
Running the same thing with for example StableRNGs rng type will be faster for the randn(rng, Float32, dims) case.
I think that the comparison of times in the posted example may be misleading, because the random number generators are called with different seeds every time. I have observed time differences f1 and f2 of the same order of magnitude as between succesive calls to f1, for instance.
It would be better if each @benchmark is preceded by Random.seed! with the same number.
randn will very rarely branch out to some extra computations from what I saw in the code, so this is certainly a possibility, but one that should be removed over many samples. Since @benchmark will run 10000 samples I think looking at the minimum (or maybe median) will still give a good estimate of how fast it can be. I did run some of them a few times and those numbers seemed consistent for me then, but of course it can never hurt to seed the test as you mention.