When i benchmark the multiplication of two Float16, compared to the multiplication of two Float64, it seems it takes the same time. Is this right?

No. Benchmarking ns operations is hard. On most cpus, Float16 multiplication will be roughly 10x slower than Float64. On an M1 chip, Float16 should be similar speed (but it depends what you do with it, and I think it will only be fast on 1.9)

2 Likes

For a long vector of these floats, it seems that the performance (here - 11th Gen Intel), performance is quite similar, but that `LoopVectorization`

can extract a bit of performance of the lower precision:

```
(v1.8) julia> using LoopVectorization
(v1.8) julia> f(x,y,z) = z .= x .* y
f (generic function with 2 methods)
(v1.8) julia> f2(x,y,z) = @turbo z .= x .* y
f2 (generic function with 1 method)
(v1.8) julia> x = rand(Float16, 10^6); y = rand(Float16, 10^6); z = similar(x);
(v1.8) julia> @btime f($x,$y,$z);
1.090 ms (0 allocations: 0 bytes)
(v1.8) julia> @btime f2($x,$y,$z);
887.598 μs (0 allocations: 0 bytes)
(v1.8) julia> x = rand(Float64, 10^6); y = rand(Float64, 10^6); z = similar(x);
(v1.8) julia> @btime f($x,$y,$z);
1.048 ms (0 allocations: 0 bytes)
(v1.8) julia> @btime f2($x,$y,$z);
1.070 ms (0 allocations: 0 bytes)
(v1.8) julia> x = rand(Float32, 10^6); y = rand(Float32, 10^6); z = similar(x);
(v1.8) julia> @btime f($x,$y,$z);
274.767 μs (0 allocations: 0 bytes)
(v1.8) julia> @btime f2($x,$y,$z);
273.939 μs (0 allocations: 0 bytes)
```

Not much, though, I was expecting a significant speedup due to to SIMD here, similar or greater to what one gets with 32bits.

1 Like