Fast floating point quantisation / rounding

jackf · May 14, 2020, 12:12am

I have a large number of Float32s in an array. I want to apply the transformation f(x) = round(Int16, x * a + b, RoundToZero) where a and b are constants. I also need to apply the inverse transform (but multiplication by 1/a is tolerable, rather than division by a). All of the input values are known to be bounded to within the range of valid Int16s after the linear transform, so overflow / erroring from unrepresentability / etc are not an issue (and I would like to use this knowledge to increase throughput).

Ignoring the linear transform for now, does anyone have ideas for how to do the rounding part super efficiently? Simple test code like:

function roundtest2(x, out)
    @inbounds @simd for i in eachindex(x)
        out[i] = round(Int16, x[i], RoundToZero)
    end
    out
end

works correctly, but when looking at the llvm IR there doesn’t seem to be any SIMD or anything going on. Does anyone know of any intrinsic that would speed up this kind of casting? Happy to code to a particular platform / get nasty - this code will only run on AVX512 Intel cpus.

ffevotte · May 14, 2020, 7:29am

If you know your values to be representable, you can use unsafe_trunc, which elides the preliminary checks performed by round:

function unsafe_roundtest!(out, x)
    @inbounds @simd for i in eachindex(x)
        out[i] = unsafe_trunc(Int16, x[i])
    end
end

Mandatory micro-benchmark: (click to show the code)

function roundtest!(out, x)
    @inbounds @simd for i in eachindex(x)
        out[i] = round(Int16, x[i], RoundToZero)
    end
end

function unsafe_roundtest!(out, x)
    @inbounds @simd for i in eachindex(x)
        out[i] = unsafe_trunc(Int16, x[i])
    end
end 

using BenchmarkTools

x = 1000*(rand(Float32, 1000) .- 1);
@show length(x)

out1 = similar(x, Int16);
@info "Benchmarking roundtest!"
@btime roundtest!($out1, $x)

out2 = similar(x, Int16);
@info "Benchmarking unsafe_roundtest!"
@btime unsafe_roundtest!($out2, $x)

@assert out1 == out2

length(x) = 1000
[ Info: Benchmarking roundtest!
  1.095 μs (0 allocations: 0 bytes)
[ Info: Benchmarking unsafe_roundtest!
  96.677 ns (0 allocations: 0 bytes)

ffevotte · May 14, 2020, 7:49am

Note that IIUC, LLVM will efficiently vectorize unsafe_roundtest! above: looking for VCVTPS2DQ instructions on ymm registers (this is on an AVX2 machine), it looks like the loop was both vectorized and unrolled twice:

x = rand(Float32, 10)
out = similar(x, Int16)
@code_native unsafe_roundtest!(out, x)

output

        .text
; ┌ @ essai.jl:8 within `unsafe_roundtest!'
        movq    %rsi, -8(%rsp)
        movq    8(%rsi), %rcx
; │┌ @ simdloop.jl:69 within `macro expansion'
; ││┌ @ abstractarray.jl:212 within `eachindex'
; │││┌ @ abstractarray.jl:95 within `axes1'
; ││││┌ @ abstractarray.jl:75 within `axes'
; │││││┌ @ array.jl:155 within `size'
        movq    24(%rcx), %rax
; │││││└
; │││││┌ @ tuple.jl:157 within `map'
; ││││││┌ @ range.jl:320 within `OneTo' @ range.jl:311
; │││││││┌ @ promotion.jl:409 within `max'
        testq   %rax, %rax
; ││└└└└└└
; ││ @ simdloop.jl:72 within `macro expansion'
        jle     L226
        movq    %rax, %rdx
        sarq    $63, %rdx
        andnq   %rax, %rdx, %rax
        movq    (%rsi), %rdx
        movq    (%rcx), %rcx
        movq    (%rdx), %rdx
; ││ @ simdloop.jl:75 within `macro expansion'
        cmpq    $32, %rax
        jae     L56
        xorl    %esi, %esi
        jmp     L208
; ││ @ simdloop.jl:75 within `macro expansion'
L56:
        leaq    (%rcx,%rax,4), %rsi
        cmpq    %rsi, %rdx
        jae     L81
        leaq    (%rdx,%rax,2), %rsi
; ││ @ simdloop.jl:75 within `macro expansion'
        cmpq    %rsi, %rcx
        jae     L81
        xorl    %esi, %esi
        jmp     L208
L81:
        movabsq $9223372036854775776, %rsi # imm = 0x7FFFFFFFFFFFFFE0
; ││ @ simdloop.jl:75 within `macro expansion'
        andq    %rax, %rsi
        xorl    %edi, %edi
; ││ @ simdloop.jl:77 within `macro expansion' @ essai.jl:9
; ││┌ @ float.jl:309 within `unsafe_trunc'
L96:
        vcvttps2dq      (%rcx,%rdi,4), %ymm0
        vextracti128    $1, %ymm0, %xmm1
        vpackssdw       %xmm1, %xmm0, %xmm0
        vcvttps2dq      32(%rcx,%rdi,4), %ymm1
        vextracti128    $1, %ymm1, %xmm2
        vpackssdw       %xmm2, %xmm1, %xmm1
        vcvttps2dq      64(%rcx,%rdi,4), %ymm2
        vextracti128    $1, %ymm2, %xmm3
        vpackssdw       %xmm3, %xmm2, %xmm2
        vcvttps2dq      96(%rcx,%rdi,4), %ymm3
        vextracti128    $1, %ymm3, %xmm4
        vpackssdw       %xmm4, %xmm3, %xmm3
; ││└
; ││┌ @ array.jl:826 within `setindex!'
        vmovdqu %xmm0, (%rdx,%rdi,2)
        vmovdqu %xmm1, 16(%rdx,%rdi,2)
        vmovdqu %xmm2, 32(%rdx,%rdi,2)
        vmovdqu %xmm3, 48(%rdx,%rdi,2)
; ││└
; ││ @ simdloop.jl:78 within `macro expansion'
; ││┌ @ int.jl:53 within `+'
        addq    $32, %rdi
        cmpq    %rdi, %rsi
        jne     L96
; │└└
; │┌ @ int.jl within `macro expansion'
        cmpq    %rsi, %rax
; │└
; │┌ @ simdloop.jl:75 within `macro expansion'
        je      L226
        nopw    %cs:(%rax,%rax)
        nop
; ││ @ simdloop.jl:77 within `macro expansion' @ essai.jl:9
; ││┌ @ float.jl:309 within `unsafe_trunc'
L208:
        vcvttss2si      (%rcx,%rsi,4), %edi
; ││└
; ││┌ @ array.jl:826 within `setindex!'
        movw    %di, (%rdx,%rsi,2)
; ││└
; ││ @ simdloop.jl:78 within `macro expansion'
; ││┌ @ int.jl:53 within `+'
        addq    $1, %rsi
; ││└
; ││ @ simdloop.jl:75 within `macro expansion'
; ││┌ @ int.jl:49 within `<'
        cmpq    %rax, %rsi
; ││└
        jb      L208
; └└
; ┌ @ simdloop.jl within `unsafe_roundtest!'
L226:
        movabsq $jl_system_image_data, %rax
; └
; ┌ @ essai.jl:8 within `unsafe_roundtest!'
        vzeroupper
        retq
; └

(I guess all I’m saying here is that I wouldn’t know how to get faster than that. But here on discourse, you never know: someone might very well prove me wrong in the next post )

jackf · May 17, 2020, 8:51am

Fantastic, thank you! Hadn’t noticed unsafs_trunc

Topic		Replies	Views
Is the triple `@inbounds @fastmath @simd` necessary for absolute peak performance? Performance	7	490	October 21, 2024
Sum performance for Array{Float64,2} elements Performance	13	2458	May 15, 2018
Massive performance penalty for Float16 compared to Float32 Performance performance	17	8050	June 20, 2022
LLVM vector casting? bitcast … vector<4xi64> to vector<8xi32> Internals	2	479	February 6, 2021
Making Float16 LU better in generic_lu.jl Numerics	3	233	August 26, 2023

Fast floating point quantisation / rounding

Related topics