Broadcast of .== slow performance, allocations

Why is .== so slow compared to broadcast in this example?

julia> test = [1]
1-element Array{Int64,1}:
 1

julia> @btime test .== 1
  3.304 μs (13 allocations: 4.72 KiB)
1-element BitArray{1}:
 true

julia> @btime broadcast(==,test,1)
  645.236 ns (3 allocations: 4.31 KiB)
1-element BitArray{1}:
 true

and why dooes there need to be 4kb of allocations for this?

Is there any more efficient way of doing this without such big allocations?

As for making a faster way, here is one example

function check(x::Vector)
    l = length(x)
    out = BitArray{1}(undef,l)
    for i ∈ 1:l
        out[i] = x[i] == 1
    end
    return out
end

and then

julia> @btime check([1])
  60.088 ns (3 allocations: 224 bytes)
1-element BitArray{1}:
 true

why can’t Julia have a .== function that works as fast as check does?

Use $ to flag variable arguments for accurate benchmarking:

julia> @btime $test .== 1
  946.680 ns (3 allocations: 4.31 KiB)
1-element BitArray{1}:
 true

julia> @btime broadcast(==,$test,1)
  956.500 ns (3 allocations: 4.31 KiB)
1-element BitArray{1}:
 true

(The difference is due to effective inlining with type-unstable global values)

7 Likes

As far as why it’s allocating and slower than doing something simple like check (although note that your check function isn’t doing what you think it’s doing), it’s because that simple thing is actually slower for larger BitArrays. We could add a heuristic to that optimization such that it only kicks in for broadcasts that are big enough to benefit.

I just did some playing around with a corrected check and it seems like that optimization requires a pretty big BitArray to pay off.

function check(x::Vector, y)
    l = length(x)
    out = BitArray{1}(undef, l)
    @inbounds for i ∈ 1:l
        out[i] = x[i] == y
    end
    out
end

julia> x = rand(1:10, 10);

julia> @btime $x .== 1;
384.750 ns (3 allocations: 4.31 KiB)

julia> @btime check($x, 1);
48.180 ns (2 allocations: 128 bytes)
julia> x = rand(1:10, 100);

julia> @btime $x .== 1;
446.205 ns (3 allocations: 4.31 KiB)

julia> @btime check($x, 1);
200.891 ns (2 allocations: 128 bytes)
julia> x = rand(1:10, 1000);

julia> @btime $x .== 1;
1.137 μs (3 allocations: 4.42 KiB)

julia> @btime check($x, 1);
1.709 μs (2 allocations: 240 bytes)
2 Likes

First, this is a benchmarking artifact.

Second, the base/broadcast variant appears to use some tricks. Before looking at the actual implementation, the two obvious things to do are: (1) branch-free implementation and (2) accumulate 64 values into a register and only commit it to memory afterwards. If necessary, then commit it to memory for every entry, but don’t read it back in between (writing may be necessary because the predicate might throw an exception). A write out[i]= ... involves a 64-bit read, some bit-wrangling, and a 64-bit write. Most of the time, the next index will read the just written 64-bit value again. This is a dependency chain mediated through memory, and requires load-store-forwarding. This is much slower than just keeping the value in register or even a normal load from L1 cache, cf eg here.

julia> check_bc(x, y)= x.==y;

julia> function check_naive(x::Vector, y)
           l = length(x)
           out = BitArray{1}(undef,l)
           for i ∈ 1:l
               out[i] = x[l] == 1
           end
           return out
       end
check_naive (generic function with 1 method)

julia> using BenchmarkTools

julia> y=1; xsmall=[1]; xlarge=rand(1:4, 100_003);

julia> @btime check_bc($xsmall, $y); @btime check_naive($xsmall, $y);
  872.460 ns (3 allocations: 4.31 KiB)
  57.026 ns (2 allocations: 128 bytes)

julia> @btime check_bc($xlarge, $y); @btime check_naive($xlarge, $y);
  154.259 μs (3 allocations: 16.59 KiB)
  295.649 μs (2 allocations: 12.41 KiB)

julia> @show hash(check_bc(xlarge, y).chunks);
hash((check_bc(xlarge, y)).chunks) = 0xd427141e4a359ed3

Using that knowledge about how modern CPU work, we can try to monkey-patch the base implementation. First with the “fast but wrong in view of exceptions” variant:

julia> @eval Base.Broadcast function copyto!(dest::BitArray, bc::Broadcasted{Nothing})
                  axes(dest) == axes(bc) || throwdm(axes(dest), axes(bc))
                  ischunkedbroadcast(dest, bc) && return chunkedcopyto!(dest, bc)
                  destc = dest.chunks
                  ind = cind = 1
                  bcp = preprocess(dest, bc)
                  length(bcp)<=0 && return dest
                  @inbounds for i = 0:Base.num_bit_chunks(length(bcp))-2
                      z = UInt64(0)
                      for j=0:63
                         z |= (bcp[i*64 + j + 1]::Bool) << (j&63)
                      end
                      destc[i+1] = z
                  end
                  i = Base.num_bit_chunks(length(bcp))-1
                  z = UInt64(0)
                  @inbounds for j=0:(length(bcp)-1)&63
                       z |= (bcp[i*64 + j + 1]::Bool) << (j&63)
                  end
                  @inbounds destc[i+1] = z
                  return dest
              end

julia> @btime check_bc($xsmall, $y); @btime check_naive($xsmall, $y);
  80.744 ns (4 allocations: 192 bytes)
  56.908 ns (2 allocations: 128 bytes)

julia> @btime check_bc($xlarge, $y); @btime check_naive($xlarge, $y);
  42.910 μs (4 allocations: 12.47 KiB)
  295.515 μs (2 allocations: 12.41 KiB)

julia> @show hash(check_bc(xlarge, y).chunks);
hash((check_bc(xlarge, y)).chunks) = 0xd427141e4a359ed3

Now with the correct variant:

julia> @eval Base.Broadcast function copyto!(dest::BitArray, bc::Broadcasted{Nothing})
                  axes(dest) == axes(bc) || throwdm(axes(dest), axes(bc))
                  ischunkedbroadcast(dest, bc) && return chunkedcopyto!(dest, bc)
                  destc = dest.chunks
                  ind = cind = 1
                  bcp = preprocess(dest, bc)
                  length(bcp)<=0 && return dest
                  @inbounds for i = 0:Base.num_bit_chunks(length(bcp))-2
                      z = UInt64(0)
                      for j=0:63
                         z |= (bcp[i*64 + j + 1]::Bool) << (j&63)
                         destc[i+1] = z
                      end
                  end
                  i = Base.num_bit_chunks(length(bcp))-1
                  z = UInt64(0)
                  @inbounds for j=0:(length(bcp)-1)&63
                       z |= (bcp[i*64 + j + 1]::Bool) << (j&63)
                       @inbounds destc[i+1] = z
                  end
                  return dest
                  end

julia> @btime check_bc($xsmall, $y); @btime check_naive($xsmall, $y);
  80.319 ns (4 allocations: 192 bytes)
  57.064 ns (2 allocations: 128 bytes)

julia> @btime check_bc($xlarge, $y); @btime check_naive($xlarge, $y);
  119.142 μs (4 allocations: 12.47 KiB)
  295.410 μs (2 allocations: 12.41 KiB)

julia> @show hash(check_bc(xlarge, y).chunks);
hash((check_bc(xlarge, y)).chunks) = 0xd427141e4a359ed3

I guess I’ll make a PR and hope that @mbauman can help me figure out whether this (much simpler!) algorithm can be made to work with arbitrary axes, and whether we can find a way to get the less correct variant working (the 3x speedup is definitely valid for broadcast, because we hold the only reference to the destination: In case of exceptions we don’t need to leave it in a valid state. I don’t really understand why the stores are so freakishly expensive.).

PS. When I look at the current code, then we can use the fast version: The current code makes no guarantees at all about the state of dest if an exception is triggered by the predicate (because there is no try/catch/rethrow block that dumps the current bitcache and then rethrows the exception). One might argue that this is a bug/quirk, but I am ok with that current slightly surprising behavior (semantic mismatch between broadcast! to BitArray and Array{Bool}), and setting up an exception handler would be ridiculous.

PPS. PR opened.

5 Likes