Trivial code change causes vectorization failure


#1

I am porting a few approximate math functions that I’ve used before that vectorized well, but noticed that with Julia the direct translation fails to vectorize. However, a trivial change (expanding s = f * f) makes Julia produce nicely vectorized LLVM-IR.

Is Julia’s optimizer just that fragile? Or is there a reason why fastexp1 is more difficult to optimize? As a comparison, Rust is able to optimize both versions. I’ve tried using explicit muladd()s, but the results are the same.

@fastmath function fastlog1(x::Float32)::Float32
    xi = reinterpret(Int32, x)
    e = (xi - Int32(1059760811)) & Int32(-8388608)
    m = reinterpret(Float32, xi - e)
    i = e * 1.19209290f-7
    f = m - 1f0
    s = f * f
    r = 0.230836749f0 * f + -0.279208571f0
    t = 0.331826031f0 * f + -0.498910338f0
    r = r * s + t
    r = r * s + f
    i * 0.693147182f0 + r
end

@fastmath function fastlog2(x::Float32)::Float32
    xi = reinterpret(Int32, x)
    e = (xi - Int32(1059760811)) & Int32(-8388608)
    m = reinterpret(Float32, xi - e)
    i = e * 1.19209290f-7
    f = m - 1f0
    #s = f * f
    r = 0.230836749f0 * f + -0.279208571f0
    t = 0.331826031f0 * f + -0.498910338f0
    r = r * (f*f) + t  # replaced s here
    r = r * (f*f) + f  # and here
    i * 0.693147182f0 + r
end

@fastmath function test(f)
    s = 0.0
    for i in Int32(1):Int32(1_000_000_000)
        s += f(Float32(i))
    end
    s
end
julia> @btime test(fastlog1)
  2.366 s (0 allocations: 0 bytes)
1.9723269760895107e10
julia> @btime test(fastlog2)
  524.180 ms (0 allocations: 0 bytes)
1.9723269761215004e10

#2

Good question. If nothing else, it seems like dead code elimination may be run too early. Edit: I misunderstood the difference, thanks.


#3

Absent any more comments, I’d file an issue about this.


#4

Filed: