Performance of filling an array

Ways to fill an array and performance. Peculiar to me that f6! does less allocations than the others and that f4!f6! are faster than f1!.

  • Is there an “official” way to fill an array?
  • Does the performance difference come down to bounds checking?

All benchmarks are done with self compiled Julia 1.1 from current release-1.1 branch on Ubuntu 16.04.

using BenchmarkTools

x = Array{Float64}(undef, 1000, 1000)
f1!(x) = fill!(x, 0.0)
f2!(x) = x .= 0.0;
f3!(x) = x .= 0;
f4!(x) = x[:] .= 0.0;
f5!(x) = x[:] .= 0;
f6!(x) = x[:,:] .= 0;
function f7!(x)
    @inbounds for i in eachindex(x)
        x[i] = 0
    end
end
function f8!(x)
    for i in eachindex(x)
        x[i] = 0
    end
end

julia> @btime f1!($x);
  655.608 μs (0 allocations: 0 bytes)

julia> @btime f2!($x);
  659.358 μs (0 allocations: 0 bytes)

julia> @btime f3!($x);
  662.742 μs (0 allocations: 0 bytes)

julia> @btime f4!($x);
  386.123 μs (3 allocations: 128 bytes)

julia> @btime f5!($x);
  378.328 μs (3 allocations: 128 bytes)

julia> @btime f6!($x);
  377.334 μs (1 allocation: 48 bytes)

julia> @btime f7!($x);
  377.607 μs (0 allocations: 0 bytes)

julia> @btime f8!($x);
  664.502 μs (0 allocations: 0 bytes)
1 Like

I do not get the same performance differences …
f1! to f3! take ~350 μs, f4! to f7! take ~250 μs, f8! takes ~650 μs.

I think fill! is intended for the ‘official’ way (if there can be one).

Differences are indeed related to bounds checking for f8! as running julia with --check-bounds=no speeds it up (but not f1-f3)

Looks kind of similar to

Hah, that’s funny. Try it with any value other than 0:

julia> x = Array{Float64}(undef, 1000, 1000)
       f1!(x) = fill!(x, 1.0)
       f2!(x) = x .= 1.0;
       f3!(x) = x .= 1;
       f4!(x) = x[:] .= 1.0;
       f5!(x) = x[:] .= 1;
       f6!(x) = x[:,:] .= 1;
       function f7!(x)
           @inbounds for i in eachindex(x)
               x[i] = 1
           end
       end
       function f8!(x)
           for i in eachindex(x)
               x[i] = 1
           end
       end
f8! (generic function with 1 method)

julia> @btime f1!($x);
  430.771 μs (0 allocations: 0 bytes)

julia> @btime f2!($x);
  429.985 μs (0 allocations: 0 bytes)

julia> @btime f3!($x);
  431.457 μs (0 allocations: 0 bytes)

julia> @btime f4!($x);
  430.093 μs (3 allocations: 128 bytes)

julia> @btime f5!($x);
  430.087 μs (3 allocations: 128 bytes)

julia> @btime f6!($x);
  432.691 μs (1 allocation: 48 bytes)

julia> @btime f7!($x);
  431.716 μs (0 allocations: 0 bytes)

julia> @btime f8!($x);
  465.570 μs (0 allocations: 0 bytes)

So what’s going on here? It’s that some of these cases allow for constant propagation of the 0 the “whole way down” to the inner loop — and if that 0 is available to LLVM at compile time, then LLVM can use special instructions to zero the entire chunk of memory.

4 Likes

As far as why some of these forms allow for constant propagation and some don’t, it appears as though there was a strange edge case in the compiler back in the 0.7 timeframe that prompted a simple workaround. That’s no longer necessary.

That’s great!

But do you know why x[:] .= 0 is faster than x .= 0?

foo!(x, val) = (x .= val)
bar!(x, val) = (x[:] .= val)

julia> @btime foo!($x, 0);
379.111 μs (0 allocations: 0 bytes)

julia> @btime bar!($x, 0);
280.690 μs (1 allocation: 48 bytes)

Yes, it’s because we have a peephole “performance optimization” in broadcast to use fill! for simple cases — because that should be the fastest way to do it. But it backfired here…

I can’t reproduce these differences on my system, I get 325.541 μs and 327.219 μs respectively.

OK, so it should be fixed once fill! is fixed, then?

It’s likely you have an older processor that doesn’t have the AVX-512 (wide SIMD) instruction set.

Yup!

3 Likes