Understanding assignment without allocation

Rajesh_Talluri · November 11, 2019, 6:29pm

I am trying to understand inplace assignment in julia.

In the following example I want to update a and b. The following code works, and it updates a and b in place without any allocations.

n = 1000
a = ones(n,1)
b = ones(n,1)
c = 1.0
d = 2.0
nsel = rand(n) .> 0.5

function testalloc!(a, b, c, d, nsel)
    @. a = c * a + d * b
    @. b = d * a + c * b
end

@btime testalloc!(a, b, c, d, nsel)
  891.646 ns (0 allocations: 0 bytes)

However, if i have to do it for a subset there are 28 allocations and it is slower


function testalloc1!(a, b, c, d, nsel)
    @. a[nsel] = c * a[nsel] + d * b[nsel]
    @. b[nsel] = d * a[nsel] + c * b[nsel]
end

@btime testalloc1!(a, b, c, d, nsel)
  3.350 μs (28 allocations: 24.22 KiB)

and if I wanted to do for the complementary subset there are 40 allocations… and it is even more slower


function testalloc2!(a, b, c, d, nsel)
    @. a[.!nsel] = c * a[.!nsel] + d * b[.!nsel]
    @. b[.!nsel] = d * a[.!nsel] + c * b[.!nsel]
end

@btime testalloc2!(a, b, c, d, nsel)
  3.775 μs (40 allocations: 27.13 KiB)

Could someone please help me understand the correct way to update subsets of arrays.

Thank You

heliosdrm · November 11, 2019, 10:24pm

I’d say it’s not the in-place assignment, but subindexing what consumes the resources:

julia> @btime a;
  2.428 ns (0 allocations: 0 bytes)

julia> @btime a[nsel];
  1.122 μs (3 allocations: 4.17 KiB)

The penalty can be reduced (but I’ve not been able to eliminate it completely) using less subindexing:

function testalloc3!(a, b, c, d, nsel)
    asel = view(a, nsel)
    bsel = view(b, nsel)
    @. asel = c * asel + d * bsel
    @. bsel = d * asel + c * bsel
end

@btime testalloc3!($a, $b, $c, $d, $nsel);
  3.682 μs (20 allocations: 8.66 KiB)


testalloc4!(a, b, c, d, nsel) = testalloc3!(a, b, c, d, .!nsel)

@btime testalloc4!($a, $b, $c, $d, $nsel);
  3.789 μs (22 allocations: 8.89 KiB)

(Edit: use views for the functions to have some effect on the vectors.)

rdeits · November 11, 2019, 10:35pm

Slicing an array in Julia creates a copy of that slice. You can eliminate the copy by creating a view, with view(a, nsel) or @view[a, nsel]. You can also convert all slices to views within a given block using the @views macro. Try ?view or ?@view from the REPL to learn more about views in Julia.

heliosdrm · November 11, 2019, 10:36pm

Yeah, I noticed it just after clicking “reply”… It’s edted now.

Rajesh_Talluri · November 12, 2019, 4:27pm

Thank you for the info about views. I missed it in the performance section. however it does not seem to help much in terms of performance in this case.

n = 10000
a = ones(n,1)
b = ones(n,1)
c = 1.0
d = 2.0
nsel = rand(n) .> 0.5

function talloca!(a, b, c, d, nsel)
    @. a[nsel] = c * a[nsel] + d * b[nsel]
    @. b[nsel] = d * a[nsel] + c * b[nsel]
end
@btime talloca!($a, $b, $c, $d, $nsel)
  31.300 μs (34 allocations: 233.94 KiB)

function tallocb!(a, b, c, d, nsel)
    a[nsel] = c .* view(a,nsel) .+ d .* view(b,nsel)
    b[nsel] = d .* view(a,nsel) .+ c .* view(b,nsel)
end
@btime tallocb!($a, $b, $c, $d, $nsel)
  35.900 μs (44 allocations: 234.25 KiB)

function tallocc!(a, b, c, d, nsel)
    asel = view(a, nsel)
    bsel = view(b, nsel)
    @. asel = c * asel + d * bsel
    @. bsel = d * asel + c * bsel
end
@btime tallocc!($a, $b, $c, $d, $nsel)
  16.100 μs (22 allocations: 78.31 KiB)

function tallocd!(a, b, c, d, nsel)
    @views a[nsel] = c .* a[nsel] .+ d .* a[nsel]
    @views b[nsel] = d .* a[nsel] .+ c .* b[nsel]
end
@btime tallocd!($a, $b, $c, $d, $nsel)
  35.299 μs (44 allocations: 234.25 KiB)

function talloce!(a, b, c, d, nsel)
    for i in findall(nsel)
      a[i] = c * a[i] + d * b[i]
      b[i] = d * a[i] + c * b[i]
    end
end
@btime talloce!($a, $b, $c, $d, $nsel)
  8.900 μs (2 allocations: 38.89 KiB)

function tallocf!(a, b, c, d, nsel)
    for i in 1:nsel.len
        if nsel[i] == 1
            a[i] = c * a[i] + d * b[i]
            b[i] = d * a[i] + c * b[i]
        end
    end
end
@btime tallocf!($a, $b, $c, $d, $nsel)
  33.899 μs (0 allocations: 0 bytes)

talloce!() and tallocf!() look better than using views. I am very surprised that using findall is so much better.

Is broadcast in general inferior to writing loops?

Could you please share any comments on these approaches.

P.S. This very weird result faster than everything while performing more operations.

function tallocg!(a, b, c, d, nsel)
    for (i,j) in (findall(.!nsel),findall(nsel))
      a[i] = c * a[i] + d * b[i]
      b[i] = d * a[i] + c * b[i]
      a[j] = c * a[j] + d * b[j] + 1.0
      b[j] = d * a[j] + c * b[j] + 1.0
    end
end
@btime tallocg!($a, $b, $c, $d, $nsel)
  7.075 μs (7 allocations: 79.73 KiB)

I am confused now on understanding how to get performance in julia . Could someone please shed some light on what concepts I am missing.

DNF · November 12, 2019, 5:47pm

Try to dot the assignment in tallocb! and tallocd!, that is, use .=

Rajesh_Talluri · November 12, 2019, 6:19pm

Hmm… That made it worse

function tallocb!(a, b, c, d, nsel)
    a[nsel] = c .* view(a,nsel) .+ d .* view(b,nsel)
    b[nsel] = d .* view(a,nsel) .+ c .* view(b,nsel)
end
  38.100 μs (44 allocations: 242.88 KiB)

function tallocb!(a, b, c, d, nsel)
    a[nsel] .= c .* view(a,nsel) .+ d .* view(b,nsel)
    b[nsel] .= d .* view(a,nsel) .+ c .* view(b,nsel)
end
  44.399 μs (70 allocations: 396.00 KiB)


function tallocd!(a, b, c, d, nsel)
    @views a[nsel] = c .* a[nsel] .+ d .* a[nsel]
    @views b[nsel] = d .* a[nsel] .+ c .* b[nsel]
end
  36.801 μs (44 allocations: 242.88 KiB)

function tallocd!(a, b, c, d, nsel)
    @views a[nsel] .= c .* a[nsel] .+ d .* a[nsel]
    @views b[nsel] .= d .* a[nsel] .+ c .* b[nsel]
end
  47.601 μs (76 allocations: 474.97 KiB)

BioTurboNick · November 12, 2019, 6:50pm

Using @profile on a looped a[nsel] .= 8:

x() = for i in 1:100000 a[nsel] .= 8 end
x() # compile
using Profile
@profile x()
Profile.print()

shows 1) it is using views internally, so an explicit view may not be necessary; and 2) most of the processing time is being spent within this array.jl line inside the collect(itr::Generator) function:

collect_to_with_first!(_array_for(typeof(v1), itr.iter, isz), v1, itr, st)

It appears that what this is doing is converting the logical bitarray into an array of indexes, which is used to construct a view on the original array based on those indexes. Stepping into this function with Juno.@run x() with the Juno debugger and setting a breakpoint in the Julia file allowed me see what was going on.

Presumably this could be optimized internally?

Mason · November 12, 2019, 7:37pm

UnsafeArrays.jl is useful for squeezing out fewer allocations here.

using UnsafeArrays

n = 1000
a = ones(n,1)
b = ones(n,1)
c = 1.0
d = 2.0
nsel = rand(n) .> 0.5

function testalloc!(a, b, c, d, nsel)
    @. a = c * a + d * b
    @. b = d * a + c * b
end

function testalloc3!(a, b, c, d, nsel)
    @uviews a b begin 
        asel = a[nsel]
        bsel = b[nsel]
        @. asel = c * asel + d * bsel
        @. bsel = d * asel + c * bsel
    end
end

julia> @btime testalloc!($a, $b, $c, $d, $nsel);
  1.149 μs (0 allocations: 0 bytes)

julia> @btime testalloc3!($a, $b, $c, $d, $nsel);
  1.339 μs (6 allocations: 7.97 KiB)

but it seems that a good ol’ for loop is still the best if you really want to cut out all allocations and get the best speed you can.

function testalloce!(a, b, c, d, nsel)
    @assert length(a) == length(b) == length(nsel)
    @inbounds for i in Iterators.filter(i -> nsel[i], 1:length(nsel))
        a[i] = c * a[i] + d * b[i]
        b[i] = d * a[i] + c * b[i]
    end
end

julia> @btime testalloce!($a, $b, $c, $d, $nsel);
  942.433 ns (0 allocations: 0 bytes)

Rajesh_Talluri · November 12, 2019, 7:40pm

Hmm… so logical indexing is worse than integer indexing? I thought it was the opposite… or at least same

function talloca!(a, b, c, d, nsel)
    @. a[nsel] = c * a[nsel] + d * b[nsel]
    @. b[nsel] = d * a[nsel] + c * b[nsel]
end

function tallocb!(a, b, c, d, nsel)
    @. a[nsel] = c * a[nsel] + d * b[nsel]
    @. b[nsel] = d * a[nsel] + c * b[nsel]
end

function ftest()
n = 10000
a = ones(n,1)
b = ones(n,1)
c = 1.0
d = 2.0
nsela = (rand(n) .> 0.5) |> findall
nselb = (rand(n) .> 0.5) 

@btime talloca!($a, $b, $c, $d, $nsela)
@btime tallocb!($a, $b, $c, $d, $nselb)
end

 julia> ftest()
  29.600 μs (22 allocations: 156.69 KiB)
  31.700 μs (34 allocations: 232.44 KiB)

So this shows that integer indexing is better…

Mason · November 12, 2019, 7:45pm

Inferior at what? If you mean “inferior for writing concise, readable, flexible code” then I’d say no, it’s not. If you mean “is it generally possible to write a for loop that outperforms broadcast?” then yes, that’s often true.

Broadcast is just syntax for asking julia to manually create for loops for you under the hood. Anything you do with broadcast, should in theory be possible to do with a well written for loop just as fast or faster but not the other way around.

The real magic of broadcast is its flexibility. If you tried to take your for loop code and apply it to a GPU array, you’re in for a bad time, whereas the broadcast code should ‘just work’. But if you need every last ounce of performance and are willing to spend time and energy writing and maintaining for loops, yes you can get better performance with a for loop unless broadcast happened to already produce the most performant code possible.

Rajesh_Talluri · November 12, 2019, 7:49pm

Thank You! Got it.

mbauman · November 12, 2019, 7:53pm

That’s because you’re ignoring the cost of the findall — I’d expect that.

Rajesh_Talluri · November 12, 2019, 7:56pm

I mean, if I have the choice to input integer or logical indices, then it seems integer indices are better. Is that statement correct?

mbauman · November 12, 2019, 8:03pm

It depends. As always, benchmark it to find out.

Logical indexing has the big advantage that its bounds check is a single size check. It has disadvantages in that it needs to count the number of trues ahead of time and then needs to iterate over all elements (including, of course, the falses).

If you can safely turn off bounds checks and can disregard the cost of creating them, then a sorted list of integers should always win.

Topic		Replies	Views
Broadcasted assignment slower than for loop when indices are arbitrary New to Julia question , broadcast	15	1441	August 23, 2021
Migrating code from Matlab: Accessing arrays without allocating Performance	20	1947	November 19, 2018
Understanding major order performance when broadcasting in column vs row operations Performance question , array , benchmark	9	1005	June 21, 2021
Array broadcasting slower than numpy? Performance	20	722	June 4, 2022
Too many allocations when indexing with slices Performance indexing , memory-allocation	16	2749	August 17, 2018

Understanding assignment without allocation

Related topics