# Something faster than for loops

Dear All,

I am an “old” programmer used to Fortran and Matlab. I can’t get rid of For and While loops, but I know Julia can do things much faster!

I am using the following code

``````n=5
p=5
m=5

A=rand(n,p)
B=rand(p,m)
C=zeros(Float64,n,m)
for i in 1:n

for j in 1:m
C[i,j]=0.0
for k in 1:p

C[i,j] = C[i,j]+A[i,k]*B[k,j]
end
end
end
``````

If n is big, it slows down. e.g. n > 10000.

How can the code be made faster in Julia?

Loops are just as fast in Julia as they are in Fortran, and a well-written loop is often the single fastest way to implement an algorithm.

What is your code actually supposed to do? Your inner loop over `k in 1:p` just does the same operation over and over. Removing that loop would obviously speed up your code, but I suspect that isn’t what you want.

Also, have you read the Julia performance tips? That’s the first place to start to learn how to make your Julia code faster.

4 Likes

Hi,

it is just an example.

if n=m=p=5000

``````n=5000
p=5000
m=5000

A=rand(n,p);
B=rand(p,m);
C=zeros(Float64,n,m)
for i in 1:n

for j in 1:m
C[i,j]=0.0
for k in 1:p

C[i,j] = C[i,j]+A[i,j]*B[i,j]
end
end
end
``````

The simulation takes so long

I would add that… are you timing your code as it is shown or do you have it inside a function?

``````
function f()
n=5
p=5
m=5

A=rand(n,p)
B=rand(p,m)
C=zeros(Float64,n,m)
for i in 1:n
for j in 1:m
C[i,j]=0.0
for k in 1:p

C[i,j] = C[i,j]+A[i,j]*B[i,j]
end
end
end
end

@time f()
``````

This is what I get

``````julia> @time f()
0.033329 seconds (59.47 k allocations: 3.133 MiB)

julia> @time f()
0.000006 seconds (7 allocations: 1024 bytes)
``````
2 Likes

?

You should try to switch the order of the loops. In Julia matrices are stored column-major, which means that to access `A[i,j]` linearly in memory (which is fastest), `i` be the variable that changes fastest.

``````for i in 1:n
for j in 1:m
C[i,j]=0.0
....
end
end
``````

try this:

``````for j in 1:m
for i in 1:n
C[i,j]=0.0
....
end
end
``````

(the only change is that `i` is traversed in the inner loop now).

``````function f(n,p,m)

A=rand(n,p)
B=rand(p,m)
C=zeros(Float64,n,m)
for i in 1:n
for j in 1:m
C[i,j]=0.0
for k in 1:p

C[i,j] = C[i,j]+A[i,j]*B[i,j]
end
end
end
end

julia> @time f(1000,1000,1000)
1.756890 seconds (10 allocations: 22.889 MiB, 2.62% gc time)

julia> @time f(2000,2000,2000)
27.841935 seconds (10 allocations: 91.553 MiB, 0.36% gc time)

julia> @time f(5000,5000,5000)
429.574190 seconds (10 allocations: 572.205 MiB, 0.02% gc time)

``````
1 Like

in case of n= 5000 …

I just gave you the result!

``````429.574190 seconds (10 allocations: 572.205 MiB, 0.02% gc time)
`````` Thanks . You also want to try multithreading. Just adding `Threads.@threads` to the outermost loop significantly improved performance.

``````function f2(n,p,m)
A=rand(n,p)
B=rand(p,m)
C=zeros(Float64,n,m)
for i in 1:n
C[i,j]=0.0
for k in 1:p
C[i,j] = C[i,j]+A[i,j]*B[i,j]
end
end
end
end

julia> @time f2(5000,5000,5000); # with JULIA_NUM_THREADS=12
52.803032 seconds (11 allocations: 572.205 MiB, 0.21% gc time)
``````

F.Y.I. You need to set the environmental variable `JULIA_NUM_THREADS` before you start julia

3 Likes

Even without multithreading if you avoid going to the arrays 125 billion times, you can improve the performance a lot

``````function f3(n,p,m)
A=rand(n,p)
B=rand(p,m)
C=zeros(Float64,n,m)
for i in 1:n
for j in 1:m
aux=0.0
A_i_j = A[i,j]
B_i_j = B[i,j]
for k in 1:p
aux += A_i_j *B_i_j
end
C[i,j]=aux
end
end
end
``````
``````@time f3(5000,5000,5000)
127.116506 seconds (10 allocations: 572.205 MiB, 0.10% gc time)
``````
2 Likes

And `@inbounds` and `@simd`.

However, the toy example is starting to wear out its usefulness. There is nothing faster than loops since loops are just a way of jumping back to an earlier instruction. Instead one should talk about the operation that should be performed and how that can be done as efficiently as possible.

7 Likes

Just for fun

``````function f3(n,p,m)
A=rand(n,p)
B=rand(p,m)
C=zeros(Float64,n,m)
@inbounds for i in 1:n
for j in 1:m
aux=0.0
A_i_j = A[i,j]
B_i_j = B[i,j]
@simd for k in 1:p
aux += A_i_j *B_i_j
end
C[i,j]=aux
end
end
end
``````

Single threaded code in my 2013 laptop

``````@time f3(5000,5000,5000)
14.279610 seconds (10 allocations: 572.205 MiB, 0.69% gc time)
``````
3 Likes

Great, this is the answer which I need !

Remember that you can get a nice speed boost if you don’t need Float64 (going down to 7 sec in Float32). Even bigger if your CPU has AVX512.

``````function f5(n,p,m)
A=rand(Float32, n,p)
B=rand(Float32, p,m)
C=zeros(Float32, n,m)
@inbounds for i in 1:n
for j in 1:m
aux=Float32(0.0)
A_i_j = A[i,j]
B_i_j = B[i,j]
@simd for k in 1:p
aux += A_i_j *B_i_j
end
C[i,j]=aux
end
end
end

@time f5(5000,5000,5000)
7.271317 seconds (10 allocations: 286.103 MiB, 1.38% gc time)

``````
2 Likes

Why not `aux = p* A_i_j *B_i_j` instead?

But I think some things are wrong with this code. I don’t know what it’s supposed to do.
It looks like the code is simply doing:

``````C .= p .* A .* B
``````

But the dimensions don’t match up correctly. We just so happen to have that `n == m == p` in the examples. If they don’t, we could get segfaults, etc.

knowing what multiplication means is indeed useful. I guess the question was about how to use “julia syntax” not how to “rethink” the code to go from O(n^3) to O(n^2).

Is this example trying to do matrix multiplication? That’s what the sizes of the input and output matrices would suggest…

3 Likes

I think so. That would explain the three loops.

In that case, @Aquaman:
Just use the BLAS library; it will be many times faster than the loops.

``````# If C has not already been allocated
C = A * B
# If you pre-allocated C
using LinearAlgebra
mul!(C, A, B)
``````

The BLAS libraries also use loops – but there are actually five of them, not three!*
They loop over an optimized microkernel. The optimized kernel is architecture dependent, so OpenBLAS (and other optimized BLAS libraries) will choose the correct one based on your CPU.