Performance of filling an array

Ways to fill an array and performance. Peculiar to me that f6! does less allocations than the others and that f4!f6! are faster than f1!.

  • Is there an “official” way to fill an array?
  • Does the performance difference come down to bounds checking?

All benchmarks are done with self compiled Julia 1.1 from current release-1.1 branch on Ubuntu 16.04.

using BenchmarkTools

x = Array{Float64}(undef, 1000, 1000)
f1!(x) = fill!(x, 0.0)
f2!(x) = x .= 0.0;
f3!(x) = x .= 0;
f4!(x) = x[:] .= 0.0;
f5!(x) = x[:] .= 0;
f6!(x) = x[:,:] .= 0;
function f7!(x)
    @inbounds for i in eachindex(x)
        x[i] = 0
    end
end
function f8!(x)
    for i in eachindex(x)
        x[i] = 0
    end
end

julia> @btime f1!($x);
  655.608 μs (0 allocations: 0 bytes)

julia> @btime f2!($x);
  659.358 μs (0 allocations: 0 bytes)

julia> @btime f3!($x);
  662.742 μs (0 allocations: 0 bytes)

julia> @btime f4!($x);
  386.123 μs (3 allocations: 128 bytes)

julia> @btime f5!($x);
  378.328 μs (3 allocations: 128 bytes)

julia> @btime f6!($x);
  377.334 μs (1 allocation: 48 bytes)

julia> @btime f7!($x);
  377.607 μs (0 allocations: 0 bytes)

julia> @btime f8!($x);
  664.502 μs (0 allocations: 0 bytes)

I do not get the same performance differences …
f1! to f3! take ~350 μs, f4! to f7! take ~250 μs, f8! takes ~650 μs.

I think fill! is intended for the ‘official’ way (if there can be one).

Differences are indeed related to bounds checking for f8! as running julia with --check-bounds=no speeds it up (but not f1-f3)

Looks kind of similar to

Hah, that’s funny. Try it with any value other than 0:

julia> x = Array{Float64}(undef, 1000, 1000)
       f1!(x) = fill!(x, 1.0)
       f2!(x) = x .= 1.0;
       f3!(x) = x .= 1;
       f4!(x) = x[:] .= 1.0;
       f5!(x) = x[:] .= 1;
       f6!(x) = x[:,:] .= 1;
       function f7!(x)
           @inbounds for i in eachindex(x)
               x[i] = 1
           end
       end
       function f8!(x)
           for i in eachindex(x)
               x[i] = 1
           end
       end
f8! (generic function with 1 method)

julia> @btime f1!($x);
  430.771 μs (0 allocations: 0 bytes)

julia> @btime f2!($x);
  429.985 μs (0 allocations: 0 bytes)

julia> @btime f3!($x);
  431.457 μs (0 allocations: 0 bytes)

julia> @btime f4!($x);
  430.093 μs (3 allocations: 128 bytes)

julia> @btime f5!($x);
  430.087 μs (3 allocations: 128 bytes)

julia> @btime f6!($x);
  432.691 μs (1 allocation: 48 bytes)

julia> @btime f7!($x);
  431.716 μs (0 allocations: 0 bytes)

julia> @btime f8!($x);
  465.570 μs (0 allocations: 0 bytes)

So what’s going on here? It’s that some of these cases allow for constant propagation of the 0 the “whole way down” to the inner loop — and if that 0 is available to LLVM at compile time, then LLVM can use special instructions to zero the entire chunk of memory.

As far as why some of these forms allow for constant propagation and some don’t, it appears as though there was a strange edge case in the compiler back in the 0.7 timeframe that prompted a simple workaround. That’s no longer necessary.

That’s great!

But do you know why x[:] .= 0 is faster than x .= 0?

foo!(x, val) = (x .= val)
bar!(x, val) = (x[:] .= val)

julia> @btime foo!($x, 0);
379.111 μs (0 allocations: 0 bytes)

julia> @btime bar!($x, 0);
280.690 μs (1 allocation: 48 bytes)

Yes, it’s because we have a peephole “performance optimization” in broadcast to use fill! for simple cases — because that should be the fastest way to do it. But it backfired here…

I can’t reproduce these differences on my system, I get 325.541 μs and 327.219 μs respectively.

OK, so it should be fixed once fill! is fixed, then?

It’s likely you have an older processor that doesn’t have the AVX-512 (wide SIMD) instruction set.

Yup!