Improving performance of a nested for loop

Hi, sorry this is kind of a beginner/performance question. I want to generate a matrix of all of the pairwise differences among the rows of a matrix. The code below works, but gets slow as matrices get large (e.g., 1000 x 200 )

 c = [1 2 3 ; 4 5 6 ; 7 8 9 ; 10 11 12];
 cdist = [0 0 0];
 
for j = 1:3:
    for h = (j+1):4 
        global cdist
        cdist = vcat(cdist, abs.(c[j, :] - c[h, :])')
        end
end
 
cdist = cdist[2:(size(cdist)[1]), :];

I read in the performance tips that it’s preferable to avoid using global variables, but I don’t understand how to output the matrix at the end without doing this…
Thanks in advance!

Welcome! Since I saw you edited your post a couple of times to try to fix the formatting, you might want to take a look at PSA: how to quote code with backticks which explains how to quote your code properly.

3 Likes

yes, thank you very much. you read my mind

To answer your actual question, there are a lot of things we can do to improve this code. The first one is to, as you mentioned, avoid using global variables. This is easy if you put your code into a function, which is always a good idea:

julia> function distance_matrix(c)
           cdist = [0 0 0]

           for j = 1:3
               for h = (j+1):4
                   cdist = vcat(cdist, abs.(c[j, :] - c[h, :])')
               end
           end

           cdist = cdist[2:(size(cdist)[1]), :];
           return cdist
       end
distance_matrix (generic function with 1 method)

julia> c = [1 2 3 ; 4 5 6 ; 7 8 9 ; 10 11 12];

julia> distance_matrix(c)
6×3 Array{Int64,2}:
 3  3  3
 6  6  6
 9  9  9
 3  3  3
 6  6  6
 3  3  3

Before we work on performance of the code, the first thing to do is to make sure it gives the correct output. Is this what you expect to get?

8 Likes

yes! this is the result I’m looking for

Ok, then we should measure the performance so we know if we’re having an effect:

julia> using Pkg

julia> Pkg.add("BenchmarkTools")

julia> using BenchmarkTools

julia> @btime distance_matrix($c)
  1.134 μs (44 allocations: 4.36 KiB)

1 microsecond is indeed pretty slow for an operation on such a small matrix. There are several things in this code that are likely to be slow:

  1. It creates a lot of intermediate vectors and copies: When you do c[j, :], you create a copy of that row of the matrix, which is wasteful in this case. You can do @view(c[j, :]) to avoid that copy. That won’t make a big difference when each row is three elements, but it’s important for larger slices.
  2. It repeatedly calls vcat to build up a matrix, which is going to be pretty slow because vcat-int a matrix requires allocating a whole new matrix to hold the result, copying the data over, and then adding the new row.

One easy way to get really good performance in this case is to pre-allocate the entire result matrix and then write out the loop:

# Returns the nth triangle number, i.e. the sum of
# 1 + 2 + ... + n
#
# (I learned this algorithm from a wonderful book called "The Number Devil"
# when I was a kid).
function triangle_number(n)
    if iseven(n)
        (n ÷ 2) * (n + 1)
    else
        ((n + 1) ÷ 2) * n
    end
end


function distance_matrix2(c)
    # Allocate a matrix to hold the result. The number of rows
    # depends on the `n - 1`th triangle number, where n is the number
    # of rows of `c`. Hooray Math!
    cdist = zeros(eltype(c), triangle_number(size(c, 1) - 1), size(c, 2))
    row = 1
    for j in 1:(size(c, 1) - 1)
        for h in (j + 1):size(c, 1)
            for k in 1:size(c, 2)
                cdist[row, k] = abs(c[j, k] - c[h, k])
            end
            row += 1
        end
    end
    return cdist
end

Let’s check the result:

julia> distance_matrix2(c) == distance_matrix(c)
true

And see how it performs:

julia> @btime distance_matrix2($c)
  61.270 ns (1 allocation: 224 bytes)

Okay, almost 20x faster, and using 20x less memory. That’s a good start.

14 Likes

Of course we can’t just stop here, because it wouldn’t be Julia if we didn’t try to make it perfect. The next issue I see with the code is that the inner-most loop is over each row of the matrix c. Iterating over the rows of a matrix in Julia is, in general, slower than iterating over the columns. That’s because matrices in Julia are stored in column-major layout, so each column is contiguous in memory. Note that this is similar to Matlab and Fortran but different than Numpy and C for boring historical reasons.

We can make the operations a bit faster, especially on large matrices, if you’re willing to decide to operate over columns instead of rows:

function distance_matrix3(c)
    cdist = zeros(eltype(c), size(c, 1), triangle_number(size(c, 2) - 1))
    col = 1
    for j in 1:(size(c, 2) - 1)
        for h in (j + 1):size(c, 2)
            for k in 1:size(c, 1)
                cdist[k, col] = abs(c[k, j] - c[k, h])
            end
            col += 1
        end
    end
    return cdist
end
julia> c_cols = copy(c')
3×4 Array{Int64,2}:
 1  4  7  10
 2  5  8  11
 3  6  9  12

julia> distance_matrix3(c_cols)
3×6 Array{Int64,2}:
 3  6  9  3  6  3
 3  6  9  3  6  3
 3  6  9  3  6  3

julia> @btime distance_matrix3($c_cols)
  58.142 ns (1 allocation: 224 bytes)

That’s a few percent faster. The difference will be larger for big matrices due to cache locality.

12 Likes

Next, you might notice that we’re creating cdist as a matrix of zeros and then setting every element. That means we’re wasting time filling the initial array with zeros. We can avoid that by constructing a matrix with undef, which tells Julia to just allocate memory but not fill it with anything:

function distance_matrix4(c)
    cdist = Matrix{eltype(c)}(undef, size(c, 1), triangle_number(size(c, 2) - 1))
    col = 1
    for j in 1:(size(c, 2) - 1)
        for h in (j + 1):size(c, 2)
            for k in 1:size(c, 1)
                cdist[k, col] = abs(c[k, j] - c[k, h])
            end
            col += 1
        end
    end
    return cdist
end
julia> @btime distance_matrix4($c_cols)
  55.658 ns (1 allocation: 224 bytes)

Very slightly faster, but the return on effort is diminishing at this point.

There’s more you can do, like using @inbounds to avoid bounds checks, using @simd to vectorize your innermost loop, and using StaticArrays.jl to represent your data as a collection of small vectors rather than a matrix. I won’t go into detail on that here because it’s been covered in a lot of other posts on this forum, but you can dig as deep as you’re interested.

Overall, though, I think the message is that if you:

  • Pre-allocate your outputs when possible
  • Avoid accidentally creating lots of copies
  • Don’t be afraid to write out a loop

then you can write Julia which is pretty close to optimal in terms of performance without going too far down the rabbit hole.

19 Likes

Wow, thank you so much for all of your work on this! This is a huge help!

This final code brought things from not finishing after > 1hr to finishing in about a second for a 132 x 533 matrix (on a laptop from 2012). I’m amazed.

6 Likes

you lucky bastard.

what a quality performance debug walk through (on a fortunately concise enough problem)

7 Likes

I know, right? Within minutes of posting, I got a masterclass

2 Likes

What a great answer!

Tiny remark regarding the ‘triangle number’

For integers n, n * (n+1) is always an even number, it therefore simplifies to

triangle_number(n) = (n * (n+1)) ÷ 2
4 Likes

Technically the first form is slightly more overflow resistant. Not sure if there is a way to get the benefits of both easily.

4 Likes

You can actually use sum(1:n), which is fast due to the nature of ranges, and has similar overflow characteristics to the first form.

(slightly faster than that, on my pc at least, is sum(Base.OneTo(n)))

4 Likes