I am porting a few approximate math functions that I’ve used before that vectorized well, but noticed that with Julia the direct translation fails to vectorize. However, a trivial change (expanding s = f * f
) makes Julia produce nicely vectorized LLVM-IR.
Is Julia’s optimizer just that fragile? Or is there a reason why fastexp1
is more difficult to optimize? As a comparison, Rust is able to optimize both versions. I’ve tried using explicit muladd()
s, but the results are the same.
@fastmath function fastlog1(x::Float32)::Float32
xi = reinterpret(Int32, x)
e = (xi - Int32(1059760811)) & Int32(-8388608)
m = reinterpret(Float32, xi - e)
i = e * 1.19209290f-7
f = m - 1f0
s = f * f
r = 0.230836749f0 * f + -0.279208571f0
t = 0.331826031f0 * f + -0.498910338f0
r = r * s + t
r = r * s + f
i * 0.693147182f0 + r
end
@fastmath function fastlog2(x::Float32)::Float32
xi = reinterpret(Int32, x)
e = (xi - Int32(1059760811)) & Int32(-8388608)
m = reinterpret(Float32, xi - e)
i = e * 1.19209290f-7
f = m - 1f0
#s = f * f
r = 0.230836749f0 * f + -0.279208571f0
t = 0.331826031f0 * f + -0.498910338f0
r = r * (f*f) + t # replaced s here
r = r * (f*f) + f # and here
i * 0.693147182f0 + r
end
@fastmath function test(f)
s = 0.0
for i in Int32(1):Int32(1_000_000_000)
s += f(Float32(i))
end
s
end
julia> @btime test(fastlog1)
2.366 s (0 allocations: 0 bytes)
1.9723269760895107e10
julia> @btime test(fastlog2)
524.180 ms (0 allocations: 0 bytes)
1.9723269761215004e10