Understanding assignment without allocation

I am trying to understand inplace assignment in julia.

In the following example I want to update a and b. The following code works, and it updates a and b in place without any allocations.

n = 1000
a = ones(n,1)
b = ones(n,1)
c = 1.0
d = 2.0
nsel = rand(n) .> 0.5

function testalloc!(a, b, c, d, nsel)
    @. a = c * a + d * b
    @. b = d * a + c * b
end

@btime testalloc!(a, b, c, d, nsel)
  891.646 ns (0 allocations: 0 bytes)

However, if i have to do it for a subset there are 28 allocations and it is slower


function testalloc1!(a, b, c, d, nsel)
    @. a[nsel] = c * a[nsel] + d * b[nsel]
    @. b[nsel] = d * a[nsel] + c * b[nsel]
end

@btime testalloc1!(a, b, c, d, nsel)
  3.350 Ī¼s (28 allocations: 24.22 KiB)

and if I wanted to do for the complementary subset there are 40 allocationsā€¦ and it is even more slower


function testalloc2!(a, b, c, d, nsel)
    @. a[.!nsel] = c * a[.!nsel] + d * b[.!nsel]
    @. b[.!nsel] = d * a[.!nsel] + c * b[.!nsel]
end

@btime testalloc2!(a, b, c, d, nsel)
  3.775 Ī¼s (40 allocations: 27.13 KiB)

Could someone please help me understand the correct way to update subsets of arrays.

Thank You

2 Likes

Iā€™d say itā€™s not the in-place assignment, but subindexing what consumes the resources:

julia> @btime a;
  2.428 ns (0 allocations: 0 bytes)

julia> @btime a[nsel];
  1.122 Ī¼s (3 allocations: 4.17 KiB)

The penalty can be reduced (but Iā€™ve not been able to eliminate it completely) using less subindexing:

function testalloc3!(a, b, c, d, nsel)
    asel = view(a, nsel)
    bsel = view(b, nsel)
    @. asel = c * asel + d * bsel
    @. bsel = d * asel + c * bsel
end

@btime testalloc3!($a, $b, $c, $d, $nsel);
  3.682 Ī¼s (20 allocations: 8.66 KiB)


testalloc4!(a, b, c, d, nsel) = testalloc3!(a, b, c, d, .!nsel)

@btime testalloc4!($a, $b, $c, $d, $nsel);
  3.789 Ī¼s (22 allocations: 8.89 KiB)

(Edit: use views for the functions to have some effect on the vectors.)

2 Likes

Slicing an array in Julia creates a copy of that slice. You can eliminate the copy by creating a view, with view(a, nsel) or @view[a, nsel]. You can also convert all slices to views within a given block using the @views macro. Try ?view or ?@view from the REPL to learn more about views in Julia.

2 Likes

Yeah, I noticed it just after clicking ā€œreplyā€ā€¦ Itā€™s edted now.

Thank you for the info about views. I missed it in the performance section. however it does not seem to help much in terms of performance in this case.

n = 10000
a = ones(n,1)
b = ones(n,1)
c = 1.0
d = 2.0
nsel = rand(n) .> 0.5

function talloca!(a, b, c, d, nsel)
    @. a[nsel] = c * a[nsel] + d * b[nsel]
    @. b[nsel] = d * a[nsel] + c * b[nsel]
end
@btime talloca!($a, $b, $c, $d, $nsel)
  31.300 Ī¼s (34 allocations: 233.94 KiB)

function tallocb!(a, b, c, d, nsel)
    a[nsel] = c .* view(a,nsel) .+ d .* view(b,nsel)
    b[nsel] = d .* view(a,nsel) .+ c .* view(b,nsel)
end
@btime tallocb!($a, $b, $c, $d, $nsel)
  35.900 Ī¼s (44 allocations: 234.25 KiB)

function tallocc!(a, b, c, d, nsel)
    asel = view(a, nsel)
    bsel = view(b, nsel)
    @. asel = c * asel + d * bsel
    @. bsel = d * asel + c * bsel
end
@btime tallocc!($a, $b, $c, $d, $nsel)
  16.100 Ī¼s (22 allocations: 78.31 KiB)

function tallocd!(a, b, c, d, nsel)
    @views a[nsel] = c .* a[nsel] .+ d .* a[nsel]
    @views b[nsel] = d .* a[nsel] .+ c .* b[nsel]
end
@btime tallocd!($a, $b, $c, $d, $nsel)
  35.299 Ī¼s (44 allocations: 234.25 KiB)

function talloce!(a, b, c, d, nsel)
    for i in findall(nsel)
      a[i] = c * a[i] + d * b[i]
      b[i] = d * a[i] + c * b[i]
    end
end
@btime talloce!($a, $b, $c, $d, $nsel)
  8.900 Ī¼s (2 allocations: 38.89 KiB)

function tallocf!(a, b, c, d, nsel)
    for i in 1:nsel.len
        if nsel[i] == 1
            a[i] = c * a[i] + d * b[i]
            b[i] = d * a[i] + c * b[i]
        end
    end
end
@btime tallocf!($a, $b, $c, $d, $nsel)
  33.899 Ī¼s (0 allocations: 0 bytes)

talloce!() and tallocf!() look better than using views. I am very surprised that using findall is so much better.

Is broadcast in general inferior to writing loops?

Could you please share any comments on these approaches.

P.S. This very weird result faster than everything while performing more operations.

function tallocg!(a, b, c, d, nsel)
    for (i,j) in (findall(.!nsel),findall(nsel))
      a[i] = c * a[i] + d * b[i]
      b[i] = d * a[i] + c * b[i]
      a[j] = c * a[j] + d * b[j] + 1.0
      b[j] = d * a[j] + c * b[j] + 1.0
    end
end
@btime tallocg!($a, $b, $c, $d, $nsel)
  7.075 Ī¼s (7 allocations: 79.73 KiB)

I am confused now on understanding how to get performance in julia :confused: . Could someone please shed some light on what concepts I am missing.

Try to dot the assignment in tallocb! and tallocd!, that is, use .=

Hmmā€¦ That made it worse

function tallocb!(a, b, c, d, nsel)
    a[nsel] = c .* view(a,nsel) .+ d .* view(b,nsel)
    b[nsel] = d .* view(a,nsel) .+ c .* view(b,nsel)
end
  38.100 Ī¼s (44 allocations: 242.88 KiB)
function tallocb!(a, b, c, d, nsel)
    a[nsel] .= c .* view(a,nsel) .+ d .* view(b,nsel)
    b[nsel] .= d .* view(a,nsel) .+ c .* view(b,nsel)
end
  44.399 Ī¼s (70 allocations: 396.00 KiB)

function tallocd!(a, b, c, d, nsel)
    @views a[nsel] = c .* a[nsel] .+ d .* a[nsel]
    @views b[nsel] = d .* a[nsel] .+ c .* b[nsel]
end
  36.801 Ī¼s (44 allocations: 242.88 KiB)
function tallocd!(a, b, c, d, nsel)
    @views a[nsel] .= c .* a[nsel] .+ d .* a[nsel]
    @views b[nsel] .= d .* a[nsel] .+ c .* b[nsel]
end
  47.601 Ī¼s (76 allocations: 474.97 KiB)

Using @profile on a looped a[nsel] .= 8:

x() = for i in 1:100000 a[nsel] .= 8 end
x() # compile
using Profile
@profile x()
Profile.print()

shows 1) it is using views internally, so an explicit view may not be necessary; and 2) most of the processing time is being spent within this array.jl line inside the collect(itr::Generator) function:

collect_to_with_first!(_array_for(typeof(v1), itr.iter, isz), v1, itr, st)

It appears that what this is doing is converting the logical bitarray into an array of indexes, which is used to construct a view on the original array based on those indexes. Stepping into this function with Juno.@run x() with the Juno debugger and setting a breakpoint in the Julia file allowed me see what was going on.

Presumably this could be optimized internally?

1 Like

UnsafeArrays.jl is useful for squeezing out fewer allocations here.

using UnsafeArrays

n = 1000
a = ones(n,1)
b = ones(n,1)
c = 1.0
d = 2.0
nsel = rand(n) .> 0.5

function testalloc!(a, b, c, d, nsel)
    @. a = c * a + d * b
    @. b = d * a + c * b
end

function testalloc3!(a, b, c, d, nsel)
    @uviews a b begin 
        asel = a[nsel]
        bsel = b[nsel]
        @. asel = c * asel + d * bsel
        @. bsel = d * asel + c * bsel
    end
end

julia> @btime testalloc!($a, $b, $c, $d, $nsel);
  1.149 Ī¼s (0 allocations: 0 bytes)

julia> @btime testalloc3!($a, $b, $c, $d, $nsel);
  1.339 Ī¼s (6 allocations: 7.97 KiB)

but it seems that a good olā€™ for loop is still the best if you really want to cut out all allocations and get the best speed you can.

function testalloce!(a, b, c, d, nsel)
    @assert length(a) == length(b) == length(nsel)
    @inbounds for i in Iterators.filter(i -> nsel[i], 1:length(nsel))
        a[i] = c * a[i] + d * b[i]
        b[i] = d * a[i] + c * b[i]
    end
end

julia> @btime testalloce!($a, $b, $c, $d, $nsel);
  942.433 ns (0 allocations: 0 bytes)
3 Likes

Hmmā€¦ so logical indexing is worse than integer indexing? I thought it was the oppositeā€¦ or at least same

function talloca!(a, b, c, d, nsel)
    @. a[nsel] = c * a[nsel] + d * b[nsel]
    @. b[nsel] = d * a[nsel] + c * b[nsel]
end

function tallocb!(a, b, c, d, nsel)
    @. a[nsel] = c * a[nsel] + d * b[nsel]
    @. b[nsel] = d * a[nsel] + c * b[nsel]
end
function ftest()
n = 10000
a = ones(n,1)
b = ones(n,1)
c = 1.0
d = 2.0
nsela = (rand(n) .> 0.5) |> findall
nselb = (rand(n) .> 0.5) 

@btime talloca!($a, $b, $c, $d, $nsela)
@btime tallocb!($a, $b, $c, $d, $nselb)
end
 julia> ftest()
  29.600 Ī¼s (22 allocations: 156.69 KiB)
  31.700 Ī¼s (34 allocations: 232.44 KiB)

So this shows that integer indexing is betterā€¦ :confused:

Inferior at what? If you mean ā€œinferior for writing concise, readable, flexible codeā€ then Iā€™d say no, itā€™s not. If you mean ā€œis it generally possible to write a for loop that outperforms broadcast?ā€ then yes, thatā€™s often true.

Broadcast is just syntax for asking julia to manually create for loops for you under the hood. Anything you do with broadcast, should in theory be possible to do with a well written for loop just as fast or faster but not the other way around.

The real magic of broadcast is its flexibility. If you tried to take your for loop code and apply it to a GPU array, youā€™re in for a bad time, whereas the broadcast code should ā€˜just workā€™. But if you need every last ounce of performance and are willing to spend time and energy writing and maintaining for loops, yes you can get better performance with a for loop unless broadcast happened to already produce the most performant code possible.

7 Likes

Thank You! Got it.:+1:

Thatā€™s because youā€™re ignoring the cost of the findall ā€” Iā€™d expect that.

1 Like

I mean, if I have the choice to input integer or logical indices, then it seems integer indices are better. Is that statement correct?

It depends. As always, benchmark it to find out.

Logical indexing has the big advantage that its bounds check is a single size check. It has disadvantages in that it needs to count the number of trues ahead of time and then needs to iterate over all elements (including, of course, the falses).

If you can safely turn off bounds checks and can disregard the cost of creating them, then a sorted list of integers should always win.

3 Likes