Understanding assignment without allocation

I am trying to understand inplace assignment in julia.

In the following example I want to update a and b. The following code works, and it updates a and b in place without any allocations.

n = 1000
a = ones(n,1)
b = ones(n,1)
c = 1.0
d = 2.0
nsel = rand(n) .> 0.5

function testalloc!(a, b, c, d, nsel)
    @. a = c * a + d * b
    @. b = d * a + c * b
end

@btime testalloc!(a, b, c, d, nsel)
  891.646 ns (0 allocations: 0 bytes)

However, if i have to do it for a subset there are 28 allocations and it is slower


function testalloc1!(a, b, c, d, nsel)
    @. a[nsel] = c * a[nsel] + d * b[nsel]
    @. b[nsel] = d * a[nsel] + c * b[nsel]
end

@btime testalloc1!(a, b, c, d, nsel)
  3.350 μs (28 allocations: 24.22 KiB)

and if I wanted to do for the complementary subset there are 40 allocations… and it is even more slower


function testalloc2!(a, b, c, d, nsel)
    @. a[.!nsel] = c * a[.!nsel] + d * b[.!nsel]
    @. b[.!nsel] = d * a[.!nsel] + c * b[.!nsel]
end

@btime testalloc2!(a, b, c, d, nsel)
  3.775 μs (40 allocations: 27.13 KiB)

Could someone please help me understand the correct way to update subsets of arrays.

Thank You

1 Like

I’d say it’s not the in-place assignment, but subindexing what consumes the resources:

julia> @btime a;
  2.428 ns (0 allocations: 0 bytes)

julia> @btime a[nsel];
  1.122 μs (3 allocations: 4.17 KiB)

The penalty can be reduced (but I’ve not been able to eliminate it completely) using less subindexing:

function testalloc3!(a, b, c, d, nsel)
    asel = view(a, nsel)
    bsel = view(b, nsel)
    @. asel = c * asel + d * bsel
    @. bsel = d * asel + c * bsel
end

@btime testalloc3!($a, $b, $c, $d, $nsel);
  3.682 μs (20 allocations: 8.66 KiB)


testalloc4!(a, b, c, d, nsel) = testalloc3!(a, b, c, d, .!nsel)

@btime testalloc4!($a, $b, $c, $d, $nsel);
  3.789 μs (22 allocations: 8.89 KiB)

(Edit: use views for the functions to have some effect on the vectors.)

2 Likes

Slicing an array in Julia creates a copy of that slice. You can eliminate the copy by creating a view, with view(a, nsel) or @view[a, nsel]. You can also convert all slices to views within a given block using the @views macro. Try ?view or ?@view from the REPL to learn more about views in Julia.

2 Likes

Yeah, I noticed it just after clicking “reply”… It’s edted now.

Thank you for the info about views. I missed it in the performance section. however it does not seem to help much in terms of performance in this case.

n = 10000
a = ones(n,1)
b = ones(n,1)
c = 1.0
d = 2.0
nsel = rand(n) .> 0.5

function talloca!(a, b, c, d, nsel)
    @. a[nsel] = c * a[nsel] + d * b[nsel]
    @. b[nsel] = d * a[nsel] + c * b[nsel]
end
@btime talloca!($a, $b, $c, $d, $nsel)
  31.300 μs (34 allocations: 233.94 KiB)

function tallocb!(a, b, c, d, nsel)
    a[nsel] = c .* view(a,nsel) .+ d .* view(b,nsel)
    b[nsel] = d .* view(a,nsel) .+ c .* view(b,nsel)
end
@btime tallocb!($a, $b, $c, $d, $nsel)
  35.900 μs (44 allocations: 234.25 KiB)

function tallocc!(a, b, c, d, nsel)
    asel = view(a, nsel)
    bsel = view(b, nsel)
    @. asel = c * asel + d * bsel
    @. bsel = d * asel + c * bsel
end
@btime tallocc!($a, $b, $c, $d, $nsel)
  16.100 μs (22 allocations: 78.31 KiB)

function tallocd!(a, b, c, d, nsel)
    @views a[nsel] = c .* a[nsel] .+ d .* a[nsel]
    @views b[nsel] = d .* a[nsel] .+ c .* b[nsel]
end
@btime tallocd!($a, $b, $c, $d, $nsel)
  35.299 μs (44 allocations: 234.25 KiB)

function talloce!(a, b, c, d, nsel)
    for i in findall(nsel)
      a[i] = c * a[i] + d * b[i]
      b[i] = d * a[i] + c * b[i]
    end
end
@btime talloce!($a, $b, $c, $d, $nsel)
  8.900 μs (2 allocations: 38.89 KiB)

function tallocf!(a, b, c, d, nsel)
    for i in 1:nsel.len
        if nsel[i] == 1
            a[i] = c * a[i] + d * b[i]
            b[i] = d * a[i] + c * b[i]
        end
    end
end
@btime tallocf!($a, $b, $c, $d, $nsel)
  33.899 μs (0 allocations: 0 bytes)

talloce!() and tallocf!() look better than using views. I am very surprised that using findall is so much better.

Is broadcast in general inferior to writing loops?

Could you please share any comments on these approaches.

P.S. This very weird result faster than everything while performing more operations.

function tallocg!(a, b, c, d, nsel)
    for (i,j) in (findall(.!nsel),findall(nsel))
      a[i] = c * a[i] + d * b[i]
      b[i] = d * a[i] + c * b[i]
      a[j] = c * a[j] + d * b[j] + 1.0
      b[j] = d * a[j] + c * b[j] + 1.0
    end
end
@btime tallocg!($a, $b, $c, $d, $nsel)
  7.075 μs (7 allocations: 79.73 KiB)

I am confused now on understanding how to get performance in julia :confused: . Could someone please shed some light on what concepts I am missing.

Try to dot the assignment in tallocb! and tallocd!, that is, use .=

Hmm… That made it worse

function tallocb!(a, b, c, d, nsel)
    a[nsel] = c .* view(a,nsel) .+ d .* view(b,nsel)
    b[nsel] = d .* view(a,nsel) .+ c .* view(b,nsel)
end
  38.100 μs (44 allocations: 242.88 KiB)
function tallocb!(a, b, c, d, nsel)
    a[nsel] .= c .* view(a,nsel) .+ d .* view(b,nsel)
    b[nsel] .= d .* view(a,nsel) .+ c .* view(b,nsel)
end
  44.399 μs (70 allocations: 396.00 KiB)

function tallocd!(a, b, c, d, nsel)
    @views a[nsel] = c .* a[nsel] .+ d .* a[nsel]
    @views b[nsel] = d .* a[nsel] .+ c .* b[nsel]
end
  36.801 μs (44 allocations: 242.88 KiB)
function tallocd!(a, b, c, d, nsel)
    @views a[nsel] .= c .* a[nsel] .+ d .* a[nsel]
    @views b[nsel] .= d .* a[nsel] .+ c .* b[nsel]
end
  47.601 μs (76 allocations: 474.97 KiB)

Using @profile on a looped a[nsel] .= 8:

x() = for i in 1:100000 a[nsel] .= 8 end
x() # compile
using Profile
@profile x()
Profile.print()

shows 1) it is using views internally, so an explicit view may not be necessary; and 2) most of the processing time is being spent within this array.jl line inside the collect(itr::Generator) function:

collect_to_with_first!(_array_for(typeof(v1), itr.iter, isz), v1, itr, st)

It appears that what this is doing is converting the logical bitarray into an array of indexes, which is used to construct a view on the original array based on those indexes. Stepping into this function with Juno.@run x() with the Juno debugger and setting a breakpoint in the Julia file allowed me see what was going on.

Presumably this could be optimized internally?

1 Like

UnsafeArrays.jl is useful for squeezing out fewer allocations here.

using UnsafeArrays

n = 1000
a = ones(n,1)
b = ones(n,1)
c = 1.0
d = 2.0
nsel = rand(n) .> 0.5

function testalloc!(a, b, c, d, nsel)
    @. a = c * a + d * b
    @. b = d * a + c * b
end

function testalloc3!(a, b, c, d, nsel)
    @uviews a b begin 
        asel = a[nsel]
        bsel = b[nsel]
        @. asel = c * asel + d * bsel
        @. bsel = d * asel + c * bsel
    end
end

julia> @btime testalloc!($a, $b, $c, $d, $nsel);
  1.149 μs (0 allocations: 0 bytes)

julia> @btime testalloc3!($a, $b, $c, $d, $nsel);
  1.339 μs (6 allocations: 7.97 KiB)

but it seems that a good ol’ for loop is still the best if you really want to cut out all allocations and get the best speed you can.

function testalloce!(a, b, c, d, nsel)
    @assert length(a) == length(b) == length(nsel)
    @inbounds for i in Iterators.filter(i -> nsel[i], 1:length(nsel))
        a[i] = c * a[i] + d * b[i]
        b[i] = d * a[i] + c * b[i]
    end
end

julia> @btime testalloce!($a, $b, $c, $d, $nsel);
  942.433 ns (0 allocations: 0 bytes)
3 Likes

Hmm… so logical indexing is worse than integer indexing? I thought it was the opposite… or at least same

function talloca!(a, b, c, d, nsel)
    @. a[nsel] = c * a[nsel] + d * b[nsel]
    @. b[nsel] = d * a[nsel] + c * b[nsel]
end

function tallocb!(a, b, c, d, nsel)
    @. a[nsel] = c * a[nsel] + d * b[nsel]
    @. b[nsel] = d * a[nsel] + c * b[nsel]
end
function ftest()
n = 10000
a = ones(n,1)
b = ones(n,1)
c = 1.0
d = 2.0
nsela = (rand(n) .> 0.5) |> findall
nselb = (rand(n) .> 0.5) 

@btime talloca!($a, $b, $c, $d, $nsela)
@btime tallocb!($a, $b, $c, $d, $nselb)
end
 julia> ftest()
  29.600 μs (22 allocations: 156.69 KiB)
  31.700 μs (34 allocations: 232.44 KiB)

So this shows that integer indexing is better… :confused:

Inferior at what? If you mean “inferior for writing concise, readable, flexible code” then I’d say no, it’s not. If you mean “is it generally possible to write a for loop that outperforms broadcast?” then yes, that’s often true.

Broadcast is just syntax for asking julia to manually create for loops for you under the hood. Anything you do with broadcast, should in theory be possible to do with a well written for loop just as fast or faster but not the other way around.

The real magic of broadcast is its flexibility. If you tried to take your for loop code and apply it to a GPU array, you’re in for a bad time, whereas the broadcast code should ‘just work’. But if you need every last ounce of performance and are willing to spend time and energy writing and maintaining for loops, yes you can get better performance with a for loop unless broadcast happened to already produce the most performant code possible.

6 Likes

Thank You! Got it.:+1:

That’s because you’re ignoring the cost of the findall — I’d expect that.

1 Like

I mean, if I have the choice to input integer or logical indices, then it seems integer indices are better. Is that statement correct?

It depends. As always, benchmark it to find out.

Logical indexing has the big advantage that its bounds check is a single size check. It has disadvantages in that it needs to count the number of trues ahead of time and then needs to iterate over all elements (including, of course, the falses).

If you can safely turn off bounds checks and can disregard the cost of creating them, then a sorted list of integers should always win.

3 Likes