Comprehensions and vectorization

I am evaluating performance of arrays construction in Julia using comprehensions or explicit loops by comparing the following sample codes.

Code 1:

function foo()
    s = 500
    n = 9000
    m = zeros(s, s)
    tic()
    for k = 1:n
        for i in 1:s, j in 1:s m[i, j] = j end
    end
    toc()
end

Code 2:

function foo()
    s = 500
    n = 9000
    m = zeros(s, s)
    tic()
    for k = 1:n
        m = [j for i in 1:s, j in 1:s]
    end
    toc()
end

Code 3:

function foo()
    s = 500
    n = 9000
    m = zeros(s, s)
    tic()
    for k = 1:n
        m .= [j for i in 1:s, j in 1:s]
    end
    toc()
end

On my PC, code 1 runs in about 5.4 s, code 2 in about 8.3 s, and code 3 in about 12.9 s.

I guess that code 2 is slightly slower than code 1 due to the temporary array allocation cost, but I would have expected code 3 to be roughly in par with code 1 thanks to vectorization.

Similarly I verified that the following code:

m = rand(s, s)

is faster than the following code, when tested as before:

m .= rand(s, s)

What is Julia doing in these cases when using .=?

Thanks

Julia in these cases is essentially doing the loop

for i in eachindex(m)
  m[i] = ...

i.e. it’s an inplace update of the existing array m, instead of binding it to a new array.

That is an inplace update which makes no arrays,

This creates a new array via the comprehension, and rebinds m to it (so you don’t need to pre-allocate here)

this keeps making new arrays in the comprehension, then filling m with it in a loop.

With that understanding it should be pretty obvious why code 1 < code 2 < code 3. Code 2 < code 3 since code 3 has an additional loop, whereas code 2 just changes a pointer. Vectorization in the way that creates temporary arrays is always slower, but it’s okay in most scripting languages because their looping and actual language is slower than it is to vectorize since the vectorized functions are written in C. Since Julia’s standard construct compile to code that is as fast as C (and the standard library is written in Julia, so vectorized functions are just calling other Julia code), warping code to try to be inside of a giant vectorized function call is not better in any a priori sense, so it’s really just the most efficient calculation which matters. Here, that is simply reducing the number of temporary arrays and the number of loops.

The one that you did miss is:

for k in 1:n
  m .= j
end

which is the looping-free version of your first code. Broadcasting can get as fast as the loop in an asymptotic sense, but vectorization like this isn’t strictly better in any sense.

Also, you should look into BenchmarkTools.jl for doing this kind of benchmarking.

2 Likes

“vectorization” does not make code fast and what you are doing is also not vectorization…
In general, .= is slower than =. It can only be faster by removing allocation on the RHS with fusion. In all the code you have above, you are always allocating an array on the RHS so with = you are not doing anything more but with .= you copy the data and discarded the array so the copying makes it slower.

On a (probably) unrelated note, I’m puzzled why is m in function 2 an array of Int64s, yet in function 3 it’s an array of Float64s?

By the way, you are making a classic mistake in Code 1: looping over the dimensions in a suboptimal order. Julia is column-oriented. If you swap i and j, and change your code to

function foo()
    s = 500
    n = 9000
    m = zeros(s, s)
    for k in 1:n
        for j in 1:s, i in 1:s 
            m[i, j] = j
        end
    end
    return m
end

it’s more than three times as fast on my computer.

Oh, and check out BenchmarkTools.jl for timing of code. It’s basically the standard way of timing in Julia.

Thank you for your answers, but sorry now I have more questions…

@ChrisRackauckas

If now I understand what .= does, I think the following code should trigger an error:

m .= [j for i in 1:s, j in 1:s]

like the following code does:

for i in eachindex(m)
    m[i] = [j for i in 1:s, j in 1:s]
end

Moreover, I cannot make the following code work:

for k in 1:n
  m .= j
end

could you please give me a hint on how to add it to my code examples?

@cormullion

I also wonder why m is an array of Float64 in code 1 and 3, while it is an array of Int64 in code 2. I think it should be an array of Int64 in all three codes.

@DNF

Thank you for your hint. I confirm that looping first over columns and then over rows is much faster. So is there a reason why neither eachindex nor comprehensions loop in that order?

On my PC your code runs in about 1.8 s. In my opinion it would be very useful to have a compact vectorized way to construct an array which allows to reach a similar performance.
Maybe something like:

m .= .[j for i in 1:s, j in 1:s]

or:

m .= j for i in 1:s, j in 1:s

or:

m .= collect.(j for i in 1:s, j in 1:s)

(and they should all loop in the correct order).

No, because m is an array of size (s,s) and [j for i in 1:s, j in 1:s] is an array of size (s,s). Note that the comprehension is syntax for creating a new array. It will always create an array, so don’t use it if you don’t want the array. Broadcasts will “fuse”, but this will not turn into a loop: it will turn into two separate stages, first building a new array and then looping to save the new array element-wise into m (which is by far the worst way to perform this!).

The best answer is @DNF’s loop. If you want to vectorize it a bit, you’d do something like

function foo()
    s = 500
    n = 9000
    m = zeros(s, s)
    for k in 1:n
        for j in 1:s 
          m[:, j] .= j
        end   
    end
    return m
end

but again, don’t think that vectorizing is “good”. It can sometimes make code cleaner, but it has nothing to do with optimizing code (if anything it can add extra operations).

zeros(s,s) by default creates an array of Float64. You’re looking for zeros(Int,s,s). But here’s what happened. The first and third codes write into the array m so it stays a Float64. The second code created a new array (of Int) and replaced m with this new array, so it changed types when you did the =.

eachindex is in that order. Comprehensions are in the order you chose, which was outer then inner.

For some reason you seem really confused still. The issue is that you are mixing two concepts: pre-allocating and vectorization. There are two ways you can build the array of your dreams. One is to create an array m = zeros(s,s) and then fill it with the values. That is like code 1. The other way is to simply create that array using a comprehension, collect on a generator, etc. In this second case, don’t do m = zeros(s,s)! The generator or whatnot is all that is needed since you are created array during those expressions. In case 2, m = [j for i in 1:s, j in 1:s] is perfectly fine if you drop m = zeros(s,s) which is simply not used at all.

3 Likes

No it’s perfectly well defined so it must be allowed. It’s not really the job of the parser to figure out what you want to do and there are also cases where this might not have performance issue at all (and can be faster than a naively fused loop).

.= is NOT an assignment, it’s a mutation of the object, which cannot change it’s type. OTOH, = is assignment, and the type of the LHS after the assignment is completely unrelated to it’s type before this point.

eachindex does. Comprehensions loops in whatever order you write so it’s up to you to write the right order.

4 Likes

I think you’ve got it wrong, how .= works. The above is basically the same as this:

temp = [j for i in 1:s, j in 1:s]  # first, create array
m .= temp  # then, copy it into `m` element-wise.

This causes redundant work compared to just

m = [j for i in 1:s, j in 1:s]
2 Likes

We could potentially change this and make m .= comprehension fusing by definition.

On this general subject as a whole, Julia really does have all the tools in the area that one could want but I think it’s still a bit confusing to the newcomer – a clear and definitive guide to vectorization and looping in Julia remains to be written. We’ve got some great blog posts on the subject, but one article that lays it all out doesn’t seem to exist.

9 Likes

Also the idea of allowing this syntax with a (non-allocating) generator was discussed in Broadcast had one job (e.g. broadcasting over iterators and generator) · Issue #18618 · JuliaLang/julia · GitHub, i.e. writing m .= (j for i in 1:s, j in 1:s), which seems natural.

1 Like

Although this is not directly related to the OP’s question: in case of performance I would add that @inbounds (e.g. in front of m[i,j] ) should increase the performance further (about 7% on my computer)

Sure, although I don’t think the lack of allocation would be observable in the array case, so it would be valid even as a pure optimization to overwrite m in place.

1 Like

IMHO the clearest notation would be:

m .= .[j for i in 1:s, j in 1:s]

which I find in line with other loop fusion notation, such as .+, .*, etc.

1 Like