The columns and the rows - puzzled about performance

Dear all,

I have been using Julia for about three weeks now and, to my understanding, she uses a column-major matrix storage system. That would also mean, if I get things right, that loops which access matrix entries in a way that inner-most loops browse over columns (second index, right?) should be more efficient than the other way around.

However, that is not what I observe in my numerical experiments. This fraction of the code shows a typical construct I use in my Julia program:

NC = 40000000

bu = ones(3,NC);  fc = ones(3,NC);  fu = ones(3,NC)
γ1 = 0.73;  γ2 = 1.0 -γ1

function F(bu :: Matrix{Float64},
           fc :: Matrix{Float64},
           fu :: Matrix{Float64},
           γ1 :: Float64,
           γ2 :: Float64,
           NT :: Int64)
  @time for i = 1 : NT
    for j = 1 : 3
      bu[j,i] = bu[j,i] + γ1 * fc[j,i] + γ2 * fu[j,i]
    end
  end
end

println("Invoke with: F(bu, fc, fu, γ1, γ2, NC)")

Which gives me times in the range of 0.24 seconds. The above fraction of the code uses inner-most loop j to browse over rows (first index) so it should be inefficient. However, if I rewrite this piece of code so that inner-most loop browses over columns (second index) like this:

NC = 40000000

bu = ones(NC,3);  fc = ones(NC,3);  fu = ones(NC,3)
γ1 = 0.73;  γ2 = 1.0 -γ1

function F(bu :: Matrix{Float64},
           fc :: Matrix{Float64},
           fu :: Matrix{Float64},
           γ1 :: Float64,
           γ2 :: Float64,
           NT :: Int64)
  @time for i = 1 : NT
    for j = 1 : 3
      bu[i,j] = bu[i,j] + γ1 * fc[i,j] + γ2 * fu[i,j]
    end
  end
end

println("Invoke with: F(bu, fc, fu, γ1, γ2, NC)")

I am consistently slower by some 20%, i.e. the execution times are in the order of 0.3 s.

What is going on here? Am I confusing columns with rows?

Please let me know what you think?

Cheers

Linking the corresponding manual section.

1 Like

Thanks. So, if the construct:

  for col = inds, row = inds
    out[row, col] = x[row]
  end

is equivalent to:

  for col = inds
    for row = inds
      out[row, col] = x[row]
    end
  end

which is more efficient way to browse, and also consistent with what I observe, the inner-most loop should go over rows, not over columns, right?

Unless I always misunderstood it, that’s what column-major implies: a column is stored contiguously in memory, so you want to have subsequent iterations of your loop access elements in the same column as much as possible. You do this by going row-by-row down the first column, then moving to the second column (which is not contiguous in memory to the first column) and going through all elements in that column row-by-row, and so on.

3 Likes

Neatly explained nilshg, now I get a clear picture :slight_smile:

Notice that the type annotation does not help for performance (while @inbounds may)

NC = 40000000

bu = ones(3, NC);  fc = ones(3, NC);  fu = ones(3, NC)
γ1 = 0.73;  γ2 = 1.0 - γ1

function F(bu, fc, fu, γ1, γ2)
    nc,nr=size(bu)
    @inbounds for j = 1:nc
        for i = 1:nr
            bu[i,j] = bu[i,j] + γ1 * fc[i,j] + γ2 * fu[i,j]
        end
    end
end


@time  F(bu, fc, fu, γ1, γ2)
@time  F(bu, fc, fu, γ1, γ2)
@time  F(bu, fc, fu, γ1, γ2)

println("Invoke with: F(bu, fc, fu, γ1, γ2)")
1 Like

Thanks for the tip!

1 Like

Although it seems already sufficiently clarified, you may perhaps still want to have a look at my take Why column major? - #36 by zdenek_hurak

2 Likes

For the record, the Wikipedia article Row- and column-major order is a valuable resource on this topic too.

3 Likes