This may be a regression. On CUDA.jl v3.1.0,
julia> main() # first run
29.039000034332275
1.7300000190734863
1.61899995803833
On v3.3.0,
julia> main() # first run
80.84599995613098
61.09299993515015
63.625
julia> main() # second run
1.5520000457763672
1.7239999771118164
1.375999927520752