Allocations when overwriting array elements in-place

I have a relatively large matrix, where I want to set all values below a certain threshold to zero:

@time A = rand(1000,1000,1000)
@time bool = A .< 0.5
@time A[bool] .= 0.0

5.653554 seconds (2 allocations: 7.451 GiB, 2.64% gc time)
1.698547 seconds (6 allocations: 119.214 MiB, 20.14% gc time)
5.429685 seconds (5 allocations: 3.725 GiB, 1.63% gc time)

I would have expected the final step to allocate essentially no memory since I am overwriting elements of an existing array in-place, however the amount of actually allocated memory equals that of creating an entirely new array containing the overwritten values (Julia 1.7.2).

Interestingly, when I instead use a view, the allocations occur during the creation of the view rather than the overwriting step.

@time Av = @view A[bool]
@time Av .= 0.0

2.786945 seconds (5 allocations: 3.725 GiB, 9.23% gc time)
1.292666 seconds

Could anyone help me understand why these allocations occur and whether there is a way to avoid them?

Those are the allocations of the broadcast mechanism. Put it in a function and those should go away.

It doesn’t look like it unfortunately - when in a function, the same amount of memory allocated in the final step:

function func()
    @time A = rand(1000,1000,1000)
    @time bool = A .< 0.5
    @time A[bool] .= 0.0
end

func()

3.824082 seconds (2 allocations: 7.451 GiB, 2.76% gc time)
2.128477 seconds (4 allocations: 119.213 MiB, 18.11% gc time)
4.128135 seconds (4 allocations: 3.725 GiB, 0.22% gc time)

I’m pretty sure this has to do with how the view of A is created from bool:

julia> A = rand(100, 100, 100);

julia> bool = A .< 0.5;

julia> using BenchmarkTools

julia> @btime @view($A[$bool]);
  428.214 μs (4 allocations: 3.81 MiB)

(Note that I reduced all the dimensions by a factor of 10 to speed up testing).

Even using views here doesn’t help because creating the view of A ends up computing the indices of all the nonzero elements of bool, which is where the allocation comes from:

julia> C = @view A[bool];

julia> C.indices
([4, 5, 6, 9, 10, 11, 12, 14, 16, 17  …  999986, 999988, 999989, 999990, 999991, 999992, 999993, 999996, 999998, 1000000],)

If you’re primarily interested in removing the allocations, then that’s really easy: Just write a loop:

julia> function set_using_mask!(A, bool)
         @assert axes(A) == axes(bool)
         @inbounds for i in eachindex(A)
           if bool[i]
             A[i] = 0
           end
         end
       end
set_using_mask! (generic function with 1 method)

julia> @btime ($A[$bool] .= 0);
  1.089 ms (4 allocations: 3.81 MiB)

julia> @btime set_using_mask!($A, $bool)
  3.799 ms (0 allocations: 0 bytes)

I was surprised to find that the loop is actually slower than the allocating broadcast. Clearly there’s some interesting benefit to the way A[bool] .= 0 computes all the nonzero indices first, then iterates over them. Indeed, we can replicate some of that effect:

julia> function set_using_mask_allocating!(A, bool)
         @assert axes(A) == axes(bool)
         indices = findall(bool)
         @inbounds for i in indices
           A[i] = 0
         end
       end
set_using_mask_allocating! (generic function with 1 method)

julia> @btime set_using_mask_allocating!($A, $bool)
  2.454 ms (2 allocations: 11.44 MiB)

I’m definitely surprised that this is faster.

3 Likes

I think A[bool] creates a large temporary matrix:

julia> @time map!(x -> x < 0.5 ? 0.0 : x, A, A);
 0.840729 seconds (14.05 k allocations: 823.300 KiB, 0.98% compilation time)

Looks like a job for replace!:

function foo!(A)
    bool = A .< 0.5
    A[bool] .= 0.0
    return A
end

bar!(A) = replace!(x->(x<0.5 ? zero(x) : x), A);

Benchmark:

1.7.2> @btime foo!(A) setup=(A=rand(100,100,100)) evals=1;
  2.003 ms (8 allocations: 3.92 MiB)

1.7.2> @btime bar!(A) setup=(A=rand(100,100,100)) evals=1;
  276.800 μs (0 allocations: 0 bytes)
8 Likes