Performance of array broadcasting

Julia v0.6.0. I have the following snippets. I wonder why “snippet_1” is so much more efficient than the others. Is it possible to make the others as efficient as “snippet_1”? Thanks!

i_ = 40
j_ = 16
a1 = rand(i_,j_,2);


# snippet_1: 0.0005 seconds (5.20 k allocations: 302.531 KiB)
@time begin
    for i in 1:i_, j in 1:j_
        a1ij = a1[i,j,:]
        a1[i,j,:] = rand(2) .* [1,1]
    end
end

# snippet_2: 0.032269 seconds (25.43 k allocations: 1.150 MiB)
@time begin
    for i in 1:i_, j in 1:j_
        a1ij = a1[i,j,:]
        a1[i,j,:] = rand(2) .* 1
    end
end

# snippet_3: 0.030712 seconds (44.24 k allocations: 1.946 MiB)
@time begin
    for i in 1:i_, j in 1:j_
        a1ij = a1[i,j,:]
        a1[i,j,:] = a1ij .* [1,1]
    end
end

# snippet_4: 0.032377 seconds (24.79 k allocations: 1.091 MiB)
@time begin
    for i in 1:i_, j in 1:j_
        a1ij = a1[i,j,:]
        a1[i,j,:] = a1ij .* 1
    end
end

You’re creating temporaries by slicing. Use @view A[i:j] etc.

Thanks for the suggesion about @view, but I’m not sure if I’m using it the right way? Using the code below, @view seems to be less efficient.


julia> @time x = a1[1,1,:]
  0.000005 seconds (6 allocations: 272 bytes)
2-element Array{Float64,1}:
 0.228546
 0.995806

julia> @time x = @view a1[1,1,:]
  0.000074 seconds (37 allocations: 896 bytes)
2-element SubArray{Float64,1,Array{Float64,3},Tuple{Int64,Int64,Base.Slice{Base.OneTo{Int64}}},true}:
 0.228546
 0.995806

Well the other thing is that your indexing is backwards. In a column-major language (Julia), you want to be slicing by columns, i.e. the first index. You want @view a1[:,1,1]. All of your loops are backwards too.

(Don’t benchmark in global scope. Use .= for in-place assignment that fuses with operations like .*.)

Note that for initializing only two components, the overhead even of creating a view, or calling broadcast, or allocating a temporary array like [1,1] will be very significant. For such a smll array in an inner loop, the fastest things are to write out your own loop or to use something like the StaticArrays package.

2 Likes

That is a good point. But my code has to slice it in all dimensions. Am I supposed to use permutedims to set the dimension I want to slice the 1st?

Thanks for the suggestion, but in this case below .= seems less efficient than =

# snippet_1: 0.0005 seconds (5.20 k allocations: 302.531 KiB)
@time begin
    for i in 1:i_, j in 1:j_
        a1ij = a1[i,j,:]
        a1[i,j,:] = rand(2) .* [1,1]
    end
end

# snippet_5: 0.006824 seconds (28.24 k allocations: 762.531 KiB)
@time begin
    for i in 1:i_, j in 1:j_
        a1ij = a1[i,j,:]
        a1[i,j,:] .= rand(2) .* [1,1]
    end
end

Again, you’re timing globals and compilation times. Put it in a function. And “vectorization” only makes sense when the vectors are sufficiently large (even then, it won’t be as efficient as a loop, just close).

I’m not sure what you mean by put it in a function. This is what I did, below = still seems more efficient than .=. I ran the last two lines a couple times to let it compile before write down the time elapsed.

function f1()
    for i in 1:i_, j in 1:j_
        a1ij = a1[i,j,:]
        a1[i,j,:] = rand(2) .* [1,1]
    end
end

function f5()
    for i in 1:i_, j in 1:j_
        a1ij = a1[i,j,:]
        a1[i,j,:] .= rand(2) .* [1,1]
    end
end

@time f1() # 0.001070 seconds (5.21 k allocations: 302.688 KiB)
@time f5() # 0.008428 seconds (28.25 k allocations: 762.688 KiB)

I guess you mean creating a vector by slicing an array isn’t as efficient as a loop? That’s a good point and I’ll remember it for future use. But for my code, I slice it because I need to use the vector for matrix multiplication later, I haven’t thought of a way to avoid slicing it into a vector.

Read the performance tips. In particular you are having problems with global variables, which is what the “put it in a function comment” was meant to address. Your attempt to put it in a function didn’t address the problem, because you were still referring to global variables, rather than passing them as arguments.

Consider the following, the only difference between the two function is that f1 takes an argument, and thus can compile to specialized code based on the type of a. Whereas f2 refers to a global variable, and since that variable can be changed at will, julia won’t specialize the code of f2 based on the type of a. Thus f2 is slow.

a=rand(40,16,2)
function f1(a)
    for i = 1:size(a,1), j=1:size(a,2)
        a[i,j,:] = rand(2)
    end
end
function f2()
    for i = 1:size(a,1), j=1:size(a,2)
        a[i,j,:] = rand(2)
    end
end
@time f1(a) #0.000263 seconds
@time f2()  #0.001927 seconds
3 Likes

ook. Now I get it. Thank everyone for the great advice!