Interesting. The scalar versions are about the same fast per element at 4 / (17-22ns) vs (8 / (41-45ns)), but the SIMD version is > 2x slower at 4 per 9-10ns vs 8 per 5-7ns.
I guess this is likely mostly a Zen1 problem, like when we had to special-case reduction handling for Zen1
when doing the sum
benchmarks.
For anyone reading unfamiliar with the problem, Zen1 has halfrate AVX2 (they didn’t actually have 256 bit units, but 128 bit units working together to emulate 256 bits).
Hence while scalar performance is similar, it takes almost twice as long to evaluate SIMD code (9-10 vs 5-7 ns). And compared to a computer with AVX512, evaluating that SIMD code is also getting just half as much work done (4 instead of 8 elements).
I guess the inlined version is at least approaching 2x faster.
Out of curiosity, does avxmax(f, x)
match avxmax_f
when you define:
@inline f(i) = cos(i) * log(i);
?
EDIT:
I was benchmarking Mose’s f
that used sin
earlier. Seems cos
is faster:
julia> @inline f_inline(i) = cos(i) * log(i);
julia> @inline f(i) = cos(i) * log(i);
julia> @btime avxmax(f, 1:1000)
1.940 μs (0 allocations: 0 bytes)
6.9043363990506466
julia> @btime avxmax(f_inline, 1:1000)
1.939 μs (0 allocations: 0 bytes)
6.9043363990506466
julia> function avxmax_f(x)
m = -Inf
@avx for i ∈ eachindex(x)
m = max(m, cos(x[i]) * log(x[i]))
end
m
end
avxmax_f (generic function with 1 method)
julia> @btime avxmax_f(1:1000)
1.894 μs (0 allocations: 0 bytes)
6.9043363990506466