Surprising benchmark results for a basic mixed precision example

I started to experiment with mixed precision and I benchmarked a basic example as follows:

using BenchmarkTools
using Quadmath
using LinearAlgebra
a16 = Vector{Float16}(undef, 100_000_000);
a32 = Vector{Float32}(undef, 100_000_000);
a64 = Vector{Float64}(undef, 100_000_000);
a128 = Vector{Float128}(undef, 100_000_000);
@benchmark norm(a16)
@benchmark norm(a32)
@benchmark norm(a64)
@benchmark norm(a128)

I get the following median times:
862.683 ms for a16
25.218 ms for a32
47.776 ms for a64
1.143 s for a128

I tried it several times, including with zeros instead of undef, and I keep getting similar results. Aren’t these results surprising? Why is the time taken by norm(a16) larger (and by so much) than the time taken by norm(a32) and norm(a64)? Also, isn’t the time taken by norm(a128) a bit large compared to the times taken by norm(a32) and norm(a64)?

There’s no hardware accelerator for that case on your CPU, so its a slow software thing.

Again, no hardware accelerator, so it’s a slower software emulation. For this case, you might want to look into double-double arithmetic, i.e. DoubleDouble.jl