Could fill! be twice as fast?


#1

I discovered an interesting thing while trying out some code on Julia 0.6. On my system, I can now fill an array of Float64 with zeros in about half the time that A .= 0.0 or fill!(A, 0.0) does it.

The reason seems to be that when fill! is compiled with a constant fill value of zero, it becomes an llvm.memset instead of a loop.

But maybe someone who’s good with llvm voodoo could make the loop run as fast as a memset? This might speed up other code as well.

Here’s an example:

julia> versioninfo()
Julia Version 0.6.0-rc1.0
Commit 6bdb3950bd (2017-05-07 00:00 UTC)
Platform Info:
  OS: macOS (x86_64-apple-darwin13.4.0)
  CPU: Intel(R) Core(TM) i5-4690K CPU @ 3.50GHz
  WORD_SIZE: 64
  BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Haswell)
  LAPACK: libopenblas64_
  LIBM: libopenlibm
  LLVM: libLLVM-3.9.1 (ORCJIT, haswell)

julia> A = Array{Float64}(1_000_000);

julia> using BenchmarkTools

julia> @benchmark A .= 0.0
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     442.393 μs (0.00% GC)
  median time:      452.308 μs (0.00% GC)
  mean time:        467.586 μs (0.00% GC)
  maximum time:     1.237 ms (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> immutable Zero end

julia> Base.convert(::Type{Float64}, ::Zero) = 0.0

julia> @benchmark A .= Zero()
BenchmarkTools.Trial: 
  memory estimate:  32 bytes
  allocs estimate:  1
  --------------
  minimum time:     237.058 μs (0.00% GC)
  median time:      240.746 μs (0.00% GC)
  mean time:        244.114 μs (0.00% GC)
  maximum time:     548.111 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

(On Julia 0.5.2, both versions run at the slower speed.)


#2

Or what about just a fill! which uses a value type for the second number? fill!(A,Val{0.0}). Could that do the same? You would only want to use this if that value was a constant, but I find that’s the most likely case (just zeroing an array).


#3

The value type would have to consist of identical bytes for the memset magic to kick in, since that function takes a byte. There must be a way to make it work for words as well…


#4

Perhaps something like:

immutable Zero{T} end

Base.convert{T}(::Type{T}, ::Zero{T}) = zero(T)

function fill2!{T <: Number}(A::Array, v::T)
    if v!= zero(T)
        fill!(A, v)
    else
        fill!(A, Zero{T}())
    end
end

I don’t think specializing on a value other than zero is useful in practice. Perhaps one.


#5

Are you using the binary release? If so, https://github.com/JuliaLang/julia/pull/21849 will fix it.

There isn’t useful to specialize on any value in practice. There isn’t a magic instruction to zero out memory faster. memset is faster only because libc uses function multi-versioning to use wider SIMD instructions.


#6

Okay, I run with a compiled sysimg and still get a difference. Is that expected?


#7

The sysimg must be compiled with native target. You can check code_llvm if vectorization kicks in.


#8

I’ve tried build_sysimg(default_sysimg_path(), "native", nothing, force=true), and I’ve tried copying the function definition of Base.fill! into the REPL. Nothing changes. I’ve even tried writing explicitly vectorized loops using SIMD.jl, without being able to improve on the current performance.

Filling with a literal that consists of identical bytes causes the compiler to yield a llvm.memset instead of an explicit loop, and somehow this is faster than any of the above.

For example, the following function is also about twice as fast as fill! on my system:

function weird_fill!(A::Array{Float64})
   for i in eachindex(A)
       @inbounds A[i] = 7.748604185489348e-304
   end
end

but if I change any digit of the constant, then it runs at the speed of fill! instead.


#9

Just a side question: How often do you need to fill big arrays with the same/null constant value? A double of performance looks impressive, but actions like this usually end up in the <1% part of my program profiles (Amdahl’s law applies…)


#10

For bit structures, I think filling with 0 bytes or 0xff bytes to initialize when you create a new array (or reinitialize it) is common enough that it would be nice to have this perform better.
For other types, probably only zero’ing out, or maybe setting a fp array to all NaN (which would need something other than just memset) would be that frequent.


#11

I don’t really care about fill! specifically, but if there’s a “trick” that memset uses and that will make code that’s limited by writes run twice as fast, then maybe that trick could speed up other things (like broadcast!) as well.


#12

According to the LLVM docs, “the ‘llvm.memset.*‘ intrinsics fill a block of memory with a particular byte value.” So one of the “tricks” here is probably just that it only operates on bytes, not on quad words. I guess this allows using faster instructions, but this doesn’t work for an arbitrary non-UInt8 value.

Special-casing zero in fill! sounds like a good idea to me given that it’s a relatively frequent use, and that it’s by far the most common value which happens to be composed of a repeated 8-bit pattern.


#13

Ref https://github.com/JuliaLang/julia/pull/21279#issuecomment-291930880 where basically the opposite effect was noticed: Generated code could be more efficient than calling system libc memset. (Although I don’t know whether llvm.memset calls out to the libc memset).


#14

That will certainly slow things down.

There should still be no “tricks” that memset can use but generic fill! for a isbits type array can’t (not on any target we care about anyway) though LLVM is known to be not smart enough in some cases. If you want to understand better, you’ll probably need to look at what llvm ir and assembly actually got generated and also compare that to the memset you are calling. Two things that I know LLVM doesn’t do is aligned simd instructions and REP STOS. According to the glibc comment, REP STOS with a wide register should also be fast and can handle more bits pattern.