# Fast floating point quantisation / rounding

I have a large number of Float32s in an array. I want to apply the transformation `f(x) = round(Int16, x * a + b, RoundToZero)` where a and b are constants. I also need to apply the inverse transform (but multiplication by 1/a is tolerable, rather than division by a). All of the input values are known to be bounded to within the range of valid Int16s after the linear transform, so overflow / erroring from unrepresentability / etc are not an issue (and I would like to use this knowledge to increase throughput).

Ignoring the linear transform for now, does anyone have ideas for how to do the rounding part super efficiently? Simple test code like:

``````function roundtest2(x, out)
@inbounds @simd for i in eachindex(x)
out[i] = round(Int16, x[i], RoundToZero)
end
out
end
``````

works correctly, but when looking at the llvm IR there doesnβt seem to be any SIMD or anything going on. Does anyone know of any intrinsic that would speed up this kind of casting? Happy to code to a particular platform / get nasty - this code will only run on AVX512 Intel cpus.

2 Likes

If you know your values to be representable, you can use `unsafe_trunc`, which elides the preliminary checks performed by `round`:

``````function unsafe_roundtest!(out, x)
@inbounds @simd for i in eachindex(x)
out[i] = unsafe_trunc(Int16, x[i])
end
end
``````
Mandatory micro-benchmark: (click to show the code)
``````function roundtest!(out, x)
@inbounds @simd for i in eachindex(x)
out[i] = round(Int16, x[i], RoundToZero)
end
end

function unsafe_roundtest!(out, x)
@inbounds @simd for i in eachindex(x)
out[i] = unsafe_trunc(Int16, x[i])
end
end

using BenchmarkTools

x = 1000*(rand(Float32, 1000) .- 1);
@show length(x)

out1 = similar(x, Int16);
@info "Benchmarking roundtest!"
@btime roundtest!(\$out1, \$x)

out2 = similar(x, Int16);
@info "Benchmarking unsafe_roundtest!"
@btime unsafe_roundtest!(\$out2, \$x)

@assert out1 == out2
``````
``````length(x) = 1000
[ Info: Benchmarking roundtest!
1.095 ΞΌs (0 allocations: 0 bytes)
[ Info: Benchmarking unsafe_roundtest!
96.677 ns (0 allocations: 0 bytes)
``````
2 Likes

Note that IIUC, LLVM will efficiently vectorize `unsafe_roundtest!` above: looking for `VCVTPS2DQ` instructions on `ymm` registers (this is on an AVX2 machine), it looks like the loop was both vectorized and unrolled twice:

``````x = rand(Float32, 10)
out = similar(x, Int16)
@code_native unsafe_roundtest!(out, x)
``````
output
``````        .text
; β @ essai.jl:8 within `unsafe_roundtest!'
movq    %rsi, -8(%rsp)
movq    8(%rsi), %rcx
; ββ @ simdloop.jl:69 within `macro expansion'
; βββ @ abstractarray.jl:212 within `eachindex'
; ββββ @ abstractarray.jl:95 within `axes1'
; βββββ @ abstractarray.jl:75 within `axes'
; ββββββ @ array.jl:155 within `size'
movq    24(%rcx), %rax
; ββββββ
; ββββββ @ tuple.jl:157 within `map'
; βββββββ @ range.jl:320 within `OneTo' @ range.jl:311
; ββββββββ @ promotion.jl:409 within `max'
testq   %rax, %rax
; ββββββββ
; ββ @ simdloop.jl:72 within `macro expansion'
jle     L226
movq    %rax, %rdx
sarq    \$63, %rdx
andnq   %rax, %rdx, %rax
movq    (%rsi), %rdx
movq    (%rcx), %rcx
movq    (%rdx), %rdx
; ββ @ simdloop.jl:75 within `macro expansion'
cmpq    \$32, %rax
jae     L56
xorl    %esi, %esi
jmp     L208
; ββ @ simdloop.jl:75 within `macro expansion'
L56:
leaq    (%rcx,%rax,4), %rsi
cmpq    %rsi, %rdx
jae     L81
leaq    (%rdx,%rax,2), %rsi
; ββ @ simdloop.jl:75 within `macro expansion'
cmpq    %rsi, %rcx
jae     L81
xorl    %esi, %esi
jmp     L208
L81:
movabsq \$9223372036854775776, %rsi # imm = 0x7FFFFFFFFFFFFFE0
; ββ @ simdloop.jl:75 within `macro expansion'
andq    %rax, %rsi
xorl    %edi, %edi
; ββ @ simdloop.jl:77 within `macro expansion' @ essai.jl:9
; βββ @ float.jl:309 within `unsafe_trunc'
L96:
vcvttps2dq      (%rcx,%rdi,4), %ymm0
vextracti128    \$1, %ymm0, %xmm1
vpackssdw       %xmm1, %xmm0, %xmm0
vcvttps2dq      32(%rcx,%rdi,4), %ymm1
vextracti128    \$1, %ymm1, %xmm2
vpackssdw       %xmm2, %xmm1, %xmm1
vcvttps2dq      64(%rcx,%rdi,4), %ymm2
vextracti128    \$1, %ymm2, %xmm3
vpackssdw       %xmm3, %xmm2, %xmm2
vcvttps2dq      96(%rcx,%rdi,4), %ymm3
vextracti128    \$1, %ymm3, %xmm4
vpackssdw       %xmm4, %xmm3, %xmm3
; βββ
; βββ @ array.jl:826 within `setindex!'
vmovdqu %xmm0, (%rdx,%rdi,2)
vmovdqu %xmm1, 16(%rdx,%rdi,2)
vmovdqu %xmm2, 32(%rdx,%rdi,2)
vmovdqu %xmm3, 48(%rdx,%rdi,2)
; βββ
; ββ @ simdloop.jl:78 within `macro expansion'
; βββ @ int.jl:53 within `+'
cmpq    %rdi, %rsi
jne     L96
; βββ
; ββ @ int.jl within `macro expansion'
cmpq    %rsi, %rax
; ββ
; ββ @ simdloop.jl:75 within `macro expansion'
je      L226
nopw    %cs:(%rax,%rax)
nop
; ββ @ simdloop.jl:77 within `macro expansion' @ essai.jl:9
; βββ @ float.jl:309 within `unsafe_trunc'
L208:
vcvttss2si      (%rcx,%rsi,4), %edi
; βββ
; βββ @ array.jl:826 within `setindex!'
movw    %di, (%rdx,%rsi,2)
; βββ
; ββ @ simdloop.jl:78 within `macro expansion'
; βββ @ int.jl:53 within `+'
; βββ
; ββ @ simdloop.jl:75 within `macro expansion'
; βββ @ int.jl:49 within `<'
cmpq    %rax, %rsi
; βββ
jb      L208
; ββ
; β @ simdloop.jl within `unsafe_roundtest!'
L226:
movabsq \$jl_system_image_data, %rax
; β
; β @ essai.jl:8 within `unsafe_roundtest!'
vzeroupper
retq
; β
``````

(I guess all Iβm saying here is that I wouldnβt know how to get faster than that. But here on discourse, you never know: someone might very well prove me wrong in the next post )

5 Likes

Fantastic, thank you! Hadnβt noticed unsafs_trunc