Something faster than for loops

Aquaman · May 8, 2019, 4:45pm

Dear All,

I am an “old” programmer used to Fortran and Matlab. I can’t get rid of For and While loops, but I know Julia can do things much faster!

I am using the following code

n=5
p=5
m=5

A=rand(n,p)
B=rand(p,m)
C=zeros(Float64,n,m)
for i in 1:n

  for j in 1:m
     C[i,j]=0.0
     for k in 1:p

        C[i,j] = C[i,j]+A[i,k]*B[k,j]
     end
   end
end

If n is big, it slows down. e.g. n > 10000.

How can the code be made faster in Julia?

rdeits · May 8, 2019, 4:49pm

Loops are just as fast in Julia as they are in Fortran, and a well-written loop is often the single fastest way to implement an algorithm.

What is your code actually supposed to do? Your inner loop over k in 1:p just does the same operation over and over. Removing that loop would obviously speed up your code, but I suspect that isn’t what you want.

Also, have you read the Julia performance tips? That’s the first place to start to learn how to make your Julia code faster.

Aquaman · May 8, 2019, 4:57pm

Hi,

it is just an example.

if n=m=p=5000

n=5000
p=5000
m=5000

A=rand(n,p);
B=rand(p,m);
C=zeros(Float64,n,m)
for i in 1:n

  for j in 1:m
     C[i,j]=0.0
     for k in 1:p

        C[i,j] = C[i,j]+A[i,j]*B[i,j]
     end
   end
end

The simulation takes so long…

davidbp · May 8, 2019, 4:58pm

I would add that… are you timing your code as it is shown or do you have it inside a function?


function f()
   n=5
   p=5
   m=5

   A=rand(n,p)
   B=rand(p,m)
   C=zeros(Float64,n,m)
   for i in 1:n
      for j in 1:m
         C[i,j]=0.0
         for k in 1:p

            C[i,j] = C[i,j]+A[i,j]*B[i,j]
         end
       end
    end
end

@time f()

This is what I get

julia> @time f()
  0.033329 seconds (59.47 k allocations: 3.133 MiB)

julia> @time f()
  0.000006 seconds (7 allocations: 1024 bytes)

Aquaman · May 8, 2019, 5:01pm

?

e3c6 · May 8, 2019, 5:05pm

You should try to switch the order of the loops. In Julia matrices are stored column-major, which means that to access A[i,j] linearly in memory (which is fastest), i be the variable that changes fastest.

Therefore, instead of

for i in 1:n
  for j in 1:m
    C[i,j]=0.0
    ....
  end
end

try this:

for j in 1:m
  for i in 1:n
    C[i,j]=0.0
    ....
  end
end

(the only change is that i is traversed in the inner loop now).

davidbp · May 8, 2019, 5:08pm

function f(n,p,m)

          A=rand(n,p)
          B=rand(p,m)
          C=zeros(Float64,n,m)
          for i in 1:n
             for j in 1:m
                C[i,j]=0.0
                for k in 1:p

                   C[i,j] = C[i,j]+A[i,j]*B[i,j]
                end
              end
           end
       end

julia> @time f(1000,1000,1000)
1.756890 seconds (10 allocations: 22.889 MiB, 2.62% gc time)

julia> @time f(2000,2000,2000)
27.841935 seconds (10 allocations: 91.553 MiB, 0.36% gc time)

julia> @time f(5000,5000,5000)
429.574190 seconds (10 allocations: 572.205 MiB, 0.02% gc time)

Aquaman · May 8, 2019, 5:18pm

in case of n= 5000 …

davidbp · May 8, 2019, 5:21pm

I just gave you the result!

429.574190 seconds (10 allocations: 572.205 MiB, 0.02% gc time)

Aquaman · May 8, 2019, 5:23pm

Thanks .

lstagner · May 8, 2019, 5:30pm

You also want to try multithreading. Just adding Threads.@threads to the outermost loop significantly improved performance.

function f2(n,p,m)
    A=rand(n,p)
    B=rand(p,m)
    C=zeros(Float64,n,m)
    Threads.@threads for j in 1:m
        for i in 1:n
            C[i,j]=0.0
            for k in 1:p
                C[i,j] = C[i,j]+A[i,j]*B[i,j]
            end
         end
     end
 end

julia> @time f2(5000,5000,5000); # with JULIA_NUM_THREADS=12 
 52.803032 seconds (11 allocations: 572.205 MiB, 0.21% gc time)

F.Y.I. You need to set the environmental variable JULIA_NUM_THREADS before you start julia

davidbp · May 8, 2019, 5:33pm

Even without multithreading if you avoid going to the arrays 125 billion times, you can improve the performance a lot

function f3(n,p,m)
   A=rand(n,p)
   B=rand(p,m)
   C=zeros(Float64,n,m)
   for i in 1:n
      for j in 1:m
         aux=0.0
         A_i_j = A[i,j]
         B_i_j = B[i,j]
         for k in 1:p
            aux += A_i_j *B_i_j
         end
         C[i,j]=aux
       end
    end
end

@time f3(5000,5000,5000)
127.116506 seconds (10 allocations: 572.205 MiB, 0.10% gc time)

kristoffer.carlsson · May 8, 2019, 5:36pm

And @inbounds and @simd.

However, the toy example is starting to wear out its usefulness. There is nothing faster than loops since loops are just a way of jumping back to an earlier instruction. Instead one should talk about the operation that should be performed and how that can be done as efficiently as possible.

davidbp · May 8, 2019, 5:39pm

Just for fun

function f3(n,p,m)
   A=rand(n,p)
   B=rand(p,m)
   C=zeros(Float64,n,m)
   @inbounds for i in 1:n
      for j in 1:m
         aux=0.0
         A_i_j = A[i,j]
         B_i_j = B[i,j]
         @simd for k in 1:p
            aux += A_i_j *B_i_j
         end
         C[i,j]=aux
       end
    end
end

Single threaded code in my 2013 laptop

@time f3(5000,5000,5000)
 14.279610 seconds (10 allocations: 572.205 MiB, 0.69% gc time)

Aquaman · May 8, 2019, 5:41pm

Great, this is the answer which I need !

davidbp · May 8, 2019, 5:54pm

Remember that you can get a nice speed boost if you don’t need Float64 (going down to 7 sec in Float32). Even bigger if your CPU has AVX512.

function f5(n,p,m)
  A=rand(Float32, n,p)
  B=rand(Float32, p,m)
  C=zeros(Float32, n,m)
  @inbounds for i in 1:n
     for j in 1:m
        aux=Float32(0.0)
        A_i_j = A[i,j]
        B_i_j = B[i,j]
        @simd for k in 1:p
           aux += A_i_j *B_i_j
        end
        C[i,j]=aux
      end
   end
end

@time f5(5000,5000,5000)
  7.271317 seconds (10 allocations: 286.103 MiB, 1.38% gc time)

Elrod · May 8, 2019, 5:56pm

Why not aux = p* A_i_j *B_i_j instead?

But I think some things are wrong with this code. I don’t know what it’s supposed to do.
It looks like the code is simply doing:

C .= p .* A .* B

But the dimensions don’t match up correctly. We just so happen to have that n == m == p in the examples. If they don’t, we could get segfaults, etc.

davidbp · May 8, 2019, 5:57pm

knowing what multiplication means is indeed useful. I guess the question was about how to use “julia syntax” not how to “rethink” the code to go from O(n^3) to O(n^2).

StefanKarpinski · May 8, 2019, 7:45pm

Is this example trying to do matrix multiplication? That’s what the sizes of the input and output matrices would suggest…

Elrod · May 8, 2019, 8:24pm

I think so. That would explain the three loops.

In that case, @Aquaman:
Just use the BLAS library; it will be many times faster than the loops.

# If C has not already been allocated
C = A * B
# If you pre-allocated C
using LinearAlgebra
mul!(C, A, B)

The BLAS libraries also use loops – but there are actually five of them, not three!*
They loop over an optimized microkernel. The optimized kernel is architecture dependent, so OpenBLAS (and other optimized BLAS libraries) will choose the correct one based on your CPU.
They’re also multithreaded.

*Although two of the loops are probably going to be completely unrolled.

Topic		Replies	Views
Speeding up the multiplying, adding, subtracting of 3D matrices Numerics question	16	730	June 24, 2023
Updating an array inside a for loop- comparison with MATLAB General Usage	17	3743	September 21, 2017
Matlab versus Julia General Usage	33	4944	July 15, 2021
Unexpected poor FOR loops performance Performance question	8	411	June 28, 2025
Why is this simple function twice as slow as its Python version Performance question	97	4430	April 12, 2021

Something faster than for loops

Related topics