Speed up column looping in matrix

I am working on a project where one of the functions is the bottleneck of speed. The problem seems to do with accessing a large matrix. I’d appreciate any advice to improve the code.

The problem is represented by the following MWE. test3() is what I need to do in the project and I found it slow. test1() and test2() are for comparisons.

using Statistics, BenchmarkTools

function test1()
    c  = rand(1000)
    M  = rand(256, 1000)
    a  = zero(eltype(c))
    @inbounds for i in 1:1000
       @views a += mean(c[i] .+ M[:,1]) # a fixed column of M
    end
    return a
end     

function test2()
    c = rand(1000)
    M = rand(256, 1000)
    a = zero(eltype(c))
    @inbounds for i in 1:1000
        @views a += mean(c[i] .+ M[:, 1 + i % 10]) # cycling through the first 10 columns of M
    end
    return a
end

function test3()
    c = rand(1000)
    M = rand(256, 1000)
    a = zero(eltype(c))
    @inbounds for i in 1:1000
       @views a += mean(c[i] .+ M[:, i]) # using each column of M
    end
    return a
end

The execution time and memory allocation is:

julia> @btime test1();
  789.300 μs (1003 allocations: 4.04 MiB)

julia> @btime test2();
  801.600 μs (1003 allocations: 4.04 MiB)

julia> @btime test3();
  985.400 μs (1003 allocations: 4.04 MiB)

test3() is more than 20% slower than test1() and test2(), though memory allocation is the same. If there a way to improve test3()'s performance?

Many thanks.

This is allocating a new array at every iteration. Preallocating the array of results, evaluating it and then computing the mean will be faster. (I’m assuming here that this is not exactly the computation you need, the other options posted below can be faster for this specific calculation).

julia> @btime test3() #original
  661.438 μs (1003 allocations: 4.04 MiB)
1007.7474721530921

julia> function test3()
           c = rand(1000)
           M = rand(256, 1000)
           a = zero(eltype(c))
           d = zeros(256)
           @inbounds for i in 1:1000
              @views d .= c[i] .+ M[:,i]
              a += mean(d) # using each column of M
           end
           return a
       end
test3 (generic function with 1 method)

julia> @btime test3()
  310.763 μs (4 allocations: 1.96 MiB)
991.5350998679983

1 Like

I just used algebra. But the slowest part is actually about generating the random numbers

using Random
function test3()
    Random.seed!(1)
    c = rand(1000)
    M = rand(256, 1000)
    a = zero(eltype(c))
    @inbounds for i in 1:1000
       @views a += mean(c[i] .+ M[:, i]) # using each column of M
    end
    return a
end

using BenchmarkTools

test3()
@benchmark test3()

function test4()
    Random.seed!(1)
    c = rand(1000)
    M = rand(256, 1000)
    a = sum(c) + mean(M)*1000
end

test4()
@benchmark test4()

test3() ≈ test4() #true
2 Likes
function test3()
           c = rand(1000)
           M = rand(256, 1000)
           a = zero(eltype(c))
           @inbounds for i in 1:1000
               s1 = 0.0
               for j  in 1:size(M, 1)
                   s1 += c[i] + M[j,i]
               end
               a += s1/size(M,1)
           end
           return a
       end

If that is the actual calculation you are performing @DataFrames approach is probably the best. But let us move the random number generation out of the function, and checking the loop speed:

julia> function test3(M,c)
                  #c = rand(1000)
                  #M = rand(256, 1000)
                  a = zero(eltype(c))
                  @inbounds for i in 1:1000
                      s1 = 0.0
                      for j  in 1:size(M, 1)
                          s1 += c[i] + M[j,i]
                      end
                      a += s1/size(M,1)
                  end
                  return a
              end
test3 (generic function with 3 methods)

julia> @btime test3($M,$c) # DataFrames version above
  230.734 μs (0 allocations: 0 bytes)
1020.9477509086946

This is probably close to the fastest you can get without multi-threading:

julia> using LoopVectorization 

julia> function test3(M,c)
         a = zero(eltype(c))
         for i in axes(M,2)
           s1 = 0.0
           @turbo for j in axes(M,1)
             s1 += c[i] + M[j,i]
           end
           a += s1
         end
         return a/size(M,1)
       end
test3 (generic function with 2 methods)

julia> @btime test3($M,$c)
  36.215 μs (0 allocations: 0 bytes)
1020.9477509086943

You can get some speedup with muliti-threading:

julia> using Base.Threads # 4 cores here

julia> function test3(M,c)
         a = zeros(nthreads())
         @threads for i in axes(M,2)
           s = 0.
           @turbo for j in axes(M,1)
             s += c[i] + M[j,i]
           end
           a[threadid()] += s
         end
         return sum(a)/size(M,1)
       end
test3 (generic function with 2 methods)

julia> @btime test3($M,$c)
  18.697 μs (42 allocations: 3.80 KiB)
1020.9477509086946

Using the cheap threads from Polyester is even better:

julia> using Polyester

julia> function test3(M,c)
         a = zeros(nthreads())
         @batch for i in axes(M,2)
           s = 0.
           @turbo for j in axes(M,1)
             s += c[i] + M[j,i]
           end
           a[threadid()] += s
         end
         return sum(a)/size(M,1)
       end
test3 (generic function with 3 methods)

julia> @btime test3($M,$c)
  12.903 μs (1 allocation: 144 bytes)
1020.947750908695

3 Likes

As @lmiq also prefaces, this advice may not be relevant unless this is literally your function, but I think re-arranging some of your operations can really benefit a lot. Here is your function as a one-liner that is pretty close to the speed of the @turbo version Leandro gives you:

test4(M,c) = sum(z->z[1]+mean(z[2]), zip(c, eachcol(M)))

Still pretty readable, in my opinion, and I think for operations like this the Julia compiler is pretty good at applying SIMD-like optimizations. This of course depends a lot on the fact that you’re using + everywhere, though, so that we can rearrange things so aggressively. This can be reduced even more to

test5(M,c) = sum(c) + sum(M)/size(M,1)

which is even a little bit faster, kind of like @xiaodai is saying.

In general, I think the point that people are making here is that it might make sense to try and optimize the actual math operations a bit more before trying to optimize the code, because there’s a lot of performance on the table just by doing that, even if you don’t want to think about SIMD/threads/eliding bounds checking/etc at all.

4 Likes

I thank all of you for the information and suggestions; they are all very helpful to me! Though my real task is not the literal sum() job, the suggestions pointed me to promising directions.

@lmiq showed the importance of preallocating array and the power of multi-threading; @xiaodai proved that a little math goes a long way; @DataFrames showed the strength of the for-loop in Julia; and @cgeoga summarized it all and the one-liner is amazing!

A good lesson to me about Julia as well as programming in general.

5 Likes