Could fill! be twice as fast?

Per · May 14, 2017, 6:50am

I discovered an interesting thing while trying out some code on Julia 0.6. On my system, I can now fill an array of Float64 with zeros in about half the time that A .= 0.0 or fill!(A, 0.0) does it.

The reason seems to be that when fill! is compiled with a constant fill value of zero, it becomes an llvm.memset instead of a loop.

But maybe someone who’s good with llvm voodoo could make the loop run as fast as a memset? This might speed up other code as well.

Here’s an example:

julia> versioninfo()
Julia Version 0.6.0-rc1.0
Commit 6bdb3950bd (2017-05-07 00:00 UTC)
Platform Info:
  OS: macOS (x86_64-apple-darwin13.4.0)
  CPU: Intel(R) Core(TM) i5-4690K CPU @ 3.50GHz
  WORD_SIZE: 64
  BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Haswell)
  LAPACK: libopenblas64_
  LIBM: libopenlibm
  LLVM: libLLVM-3.9.1 (ORCJIT, haswell)

julia> A = Array{Float64}(1_000_000);

julia> using BenchmarkTools

julia> @benchmark A .= 0.0
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     442.393 μs (0.00% GC)
  median time:      452.308 μs (0.00% GC)
  mean time:        467.586 μs (0.00% GC)
  maximum time:     1.237 ms (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> immutable Zero end

julia> Base.convert(::Type{Float64}, ::Zero) = 0.0

julia> @benchmark A .= Zero()
BenchmarkTools.Trial: 
  memory estimate:  32 bytes
  allocs estimate:  1
  --------------
  minimum time:     237.058 μs (0.00% GC)
  median time:      240.746 μs (0.00% GC)
  mean time:        244.114 μs (0.00% GC)
  maximum time:     548.111 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

(On Julia 0.5.2, both versions run at the slower speed.)

ChrisRackauckas · May 14, 2017, 8:13am

Or what about just a fill! which uses a value type for the second number? fill!(A,Val{0.0}). Could that do the same? You would only want to use this if that value was a constant, but I find that’s the most likely case (just zeroing an array).

Per · May 14, 2017, 8:21am

The value type would have to consist of identical bytes for the memset magic to kick in, since that function takes a byte. There must be a way to make it work for words as well…

kristoffer.carlsson · May 14, 2017, 10:42am

Perhaps something like:

immutable Zero{T} end

Base.convert{T}(::Type{T}, ::Zero{T}) = zero(T)

function fill2!{T <: Number}(A::Array, v::T)
    if v!= zero(T)
        fill!(A, v)
    else
        fill!(A, Zero{T}())
    end
end

I don’t think specializing on a value other than zero is useful in practice. Perhaps one.

yuyichao · May 14, 2017, 12:55pm

Are you using the binary release? If so, https://github.com/JuliaLang/julia/pull/21849 will fix it.

There isn’t useful to specialize on any value in practice. There isn’t a magic instruction to zero out memory faster. memset is faster only because libc uses function multi-versioning to use wider SIMD instructions.

kristoffer.carlsson · May 14, 2017, 12:57pm

Okay, I run with a compiled sysimg and still get a difference. Is that expected?

yuyichao · May 14, 2017, 1:00pm

The sysimg must be compiled with native target. You can check code_llvm if vectorization kicks in.

Per · May 15, 2017, 8:23am

I’ve tried build_sysimg(default_sysimg_path(), "native", nothing, force=true), and I’ve tried copying the function definition of Base.fill! into the REPL. Nothing changes. I’ve even tried writing explicitly vectorized loops using SIMD.jl, without being able to improve on the current performance.

Filling with a literal that consists of identical bytes causes the compiler to yield a llvm.memset instead of an explicit loop, and somehow this is faster than any of the above.

For example, the following function is also about twice as fast as fill! on my system:

function weird_fill!(A::Array{Float64})
   for i in eachindex(A)
       @inbounds A[i] = 7.748604185489348e-304
   end
end

but if I change any digit of the constant, then it runs at the speed of fill! instead.

lobingera · May 15, 2017, 9:00am

Just a side question: How often do you need to fill big arrays with the same/null constant value? A double of performance looks impressive, but actions like this usually end up in the <1% part of my program profiles (Amdahl’s law applies…)

ScottPJones · May 15, 2017, 9:05am

For bit structures, I think filling with 0 bytes or 0xff bytes to initialize when you create a new array (or reinitialize it) is common enough that it would be nice to have this perform better.
For other types, probably only zero’ing out, or maybe setting a fp array to all NaN (which would need something other than just memset) would be that frequent.

Per · May 15, 2017, 9:27am

I don’t really care about fill! specifically, but if there’s a “trick” that memset uses and that will make code that’s limited by writes run twice as fast, then maybe that trick could speed up other things (like broadcast!) as well.

nalimilan · May 15, 2017, 10:22am

According to the LLVM docs, “the ‘llvm.memset.*‘ intrinsics fill a block of memory with a particular byte value.” So one of the “tricks” here is probably just that it only operates on bytes, not on quad words. I guess this allows using faster instructions, but this doesn’t work for an arbitrary non-UInt8 value.

Special-casing zero in fill! sounds like a good idea to me given that it’s a relatively frequent use, and that it’s by far the most common value which happens to be composed of a repeated 8-bit pattern.

martinholters · May 15, 2017, 11:17am

Ref https://github.com/JuliaLang/julia/pull/21279#issuecomment-291930880 where basically the opposite effect was noticed: Generated code could be more efficient than calling system libc memset. (Although I don’t know whether llvm.memset calls out to the libc memset).

yuyichao · May 15, 2017, 12:55pm

That will certainly slow things down.

There should still be no “tricks” that memset can use but generic fill! for a isbits type array can’t (not on any target we care about anyway) though LLVM is known to be not smart enough in some cases. If you want to understand better, you’ll probably need to look at what llvm ir and assembly actually got generated and also compare that to the memset you are calling. Two things that I know LLVM doesn’t do is aligned simd instructions and REP STOS. According to the glibc comment, REP STOS with a wide register should also be fast and can handle more bits pattern.

Topic		Replies	Views
Performance of filling an array General Usage performance	9	968	April 5, 2019
Fastest way of getting a long zero vector? General Usage	29	864	March 15, 2024
Performance degradation of `fill` in latest Julia 0.7 Dev? Performance	2	914	March 11, 2018
How to resize and fill more efficiently? GPU	14	1030	February 9, 2023
Special case `lmul!(0, ::AbstractArray)`? Internals & Design array , linearalgebra	3	267	January 26, 2023

Could fill! be twice as fast?

Related topics