Performance of filling an array

gdkrmr · April 5, 2019, 9:38am

Ways to fill an array and performance. Peculiar to me that f6! does less allocations than the others and that f4! – f6! are faster than f1!.

Is there an “official” way to fill an array?
Does the performance difference come down to bounds checking?

All benchmarks are done with self compiled Julia 1.1 from current release-1.1 branch on Ubuntu 16.04.

using BenchmarkTools

x = Array{Float64}(undef, 1000, 1000)
f1!(x) = fill!(x, 0.0)
f2!(x) = x .= 0.0;
f3!(x) = x .= 0;
f4!(x) = x[:] .= 0.0;
f5!(x) = x[:] .= 0;
f6!(x) = x[:,:] .= 0;
function f7!(x)
    @inbounds for i in eachindex(x)
        x[i] = 0
    end
end
function f8!(x)
    for i in eachindex(x)
        x[i] = 0
    end
end

julia> @btime f1!($x);
  655.608 μs (0 allocations: 0 bytes)

julia> @btime f2!($x);
  659.358 μs (0 allocations: 0 bytes)

julia> @btime f3!($x);
  662.742 μs (0 allocations: 0 bytes)

julia> @btime f4!($x);
  386.123 μs (3 allocations: 128 bytes)

julia> @btime f5!($x);
  378.328 μs (3 allocations: 128 bytes)

julia> @btime f6!($x);
  377.334 μs (1 allocation: 48 bytes)

julia> @btime f7!($x);
  377.607 μs (0 allocations: 0 bytes)

julia> @btime f8!($x);
  664.502 μs (0 allocations: 0 bytes)

zgornel · April 5, 2019, 11:01am

I do not get the same performance differences …
f1! to f3! take ~350 μs, f4! to f7! take ~250 μs, f8! takes ~650 μs.

I think fill! is intended for the ‘official’ way (if there can be one).

Differences are indeed related to bounds checking for f8! as running julia with --check-bounds=no speeds it up (but not f1-f3)

baggepinnen · April 5, 2019, 2:00pm

Looks kind of similar to

mbauman · April 5, 2019, 2:38pm

Hah, that’s funny. Try it with any value other than 0:

julia> x = Array{Float64}(undef, 1000, 1000)
       f1!(x) = fill!(x, 1.0)
       f2!(x) = x .= 1.0;
       f3!(x) = x .= 1;
       f4!(x) = x[:] .= 1.0;
       f5!(x) = x[:] .= 1;
       f6!(x) = x[:,:] .= 1;
       function f7!(x)
           @inbounds for i in eachindex(x)
               x[i] = 1
           end
       end
       function f8!(x)
           for i in eachindex(x)
               x[i] = 1
           end
       end
f8! (generic function with 1 method)

julia> @btime f1!($x);
  430.771 μs (0 allocations: 0 bytes)

julia> @btime f2!($x);
  429.985 μs (0 allocations: 0 bytes)

julia> @btime f3!($x);
  431.457 μs (0 allocations: 0 bytes)

julia> @btime f4!($x);
  430.093 μs (3 allocations: 128 bytes)

julia> @btime f5!($x);
  430.087 μs (3 allocations: 128 bytes)

julia> @btime f6!($x);
  432.691 μs (1 allocation: 48 bytes)

julia> @btime f7!($x);
  431.716 μs (0 allocations: 0 bytes)

julia> @btime f8!($x);
  465.570 μs (0 allocations: 0 bytes)

So what’s going on here? It’s that some of these cases allow for constant propagation of the 0 the “whole way down” to the inner loop — and if that 0 is available to LLVM at compile time, then LLVM can use special instructions to zero the entire chunk of memory.

mbauman · April 5, 2019, 3:29pm

As far as why some of these forms allow for constant propagation and some don’t, it appears as though there was a strange edge case in the compiler back in the 0.7 timeframe that prompted a simple workaround. That’s no longer necessary.

github.com/JuliaLang/julia

Revert #26418, remove noinline annotation from fill!

JuliaLang:master ← JuliaLang:mb/revert26418

opened 03:29PM - 05 Apr 19 UTC

mbauman

+0 -1

This used to be necessary to avoid a strange edge case in the compiler, but it i…s no longer necessary -- and can now in fact cause other performance snags. Using the test case from [the original discourse post that prompted #26418](https://discourse.julialang.org/t/performance-degradation-of-fill-in-latest-julia-0-7-dev/9648): ```julia # BEFORE julia> @btime fill(1.0,5,5); 49.335 ns (1 allocation: 288 bytes) julia> @btime fill(0.0,5,5); 52.773 ns (1 allocation: 288 bytes) # AFTER julia> @btime fill(0.0,5,5); 46.724 ns (1 allocation: 288 bytes) julia> @btime fill(1.0,5,5); 42.202 ns (1 allocation: 288 bytes) ``` Even more compelling is the case for a larger array where LLVM can exploit some sort of wider/simdier implementation for zeros when this gets inlined thanks to constant propagation: ```julia # AFTER julia> A = Array{Float64}(undef, 1000, 1000); julia> @btime fill!($A,0.0); 345.103 μs (0 allocations: 0 bytes) julia> @btime fill!($A,1.0); 458.976 μs (0 allocations: 0 bytes) ``` Ref https://discourse.julialang.org/t/performance-of-filling-an-array/22788

DNF · April 5, 2019, 3:48pm

That’s great!

But do you know why x[:] .= 0 is faster than x .= 0?

foo!(x, val) = (x .= val)
bar!(x, val) = (x[:] .= val)

julia> @btime foo!($x, 0);
379.111 μs (0 allocations: 0 bytes)

julia> @btime bar!($x, 0);
280.690 μs (1 allocation: 48 bytes)

mbauman · April 5, 2019, 3:51pm

Yes, it’s because we have a peephole “performance optimization” in broadcast to use fill! for simple cases — because that should be the fastest way to do it. But it backfired here…

github.com

JuliaLang/julia/blob/d43c23b1d19616fabf7e593497299d4cc59acf82/base/broadcast.jl#L844-L850


      
          # Performance optimization for the common identity scalar case: dest .= val
          @inline function copyto!(dest::AbstractArray, bc::Broadcasted{<:AbstractArrayStyle{0}})
              # Typically, we must independently execute bc for every storage location in `dest`, but:
              # IF we're in the common no-op identity case with no nested args (like `dest .= val`),
              if bc.f === identity && bc.args isa Tuple{Any} && isflat(bc)
                  # THEN we can just extract the argument and `fill!` the destination with it
                  return fill!(dest, bc.args[1][])

under-Peter · April 5, 2019, 3:52pm

I can’t reproduce these differences on my system, I get 325.541 μs and 327.219 μs respectively.

DNF · April 5, 2019, 3:52pm

OK, so it should be fixed once fill! is fixed, then?

mbauman · April 5, 2019, 3:53pm

It’s likely you have an older processor that doesn’t have the AVX-512 (wide SIMD) instruction set.

Yup!

Topic		Replies	Views
Could fill! be twice as fast? Internals & Design	13	2274	May 15, 2017
Performance degradation of `fill` in latest Julia 0.7 Dev? Performance	2	914	March 11, 2018
How to resize and fill more efficiently? GPU	14	1023	February 9, 2023
Fastest way of getting a long zero vector? General Usage	29	855	March 15, 2024
Create initialized arrays of structs General Usage	2	3808	November 30, 2017

Performance of filling an array

Related topics