Can the compiler automatically optimize this code?


#1

The code below is very slow:

w=rand(300,300);x=rand(300,300);
r=vec(rand(300,1));b=rand(300,1);
j=12;N=300;

function orig()    
    sAll=0.0;k=1;
    while  k<=1000000
        sAll=sAll+sum(w[:,j].*(r + b[j]*x[:,j]));
        k=k+1;
    end
    return sAll;
end

And I write a helper function to speed it up:

function w_r_b_x_n(w,r,b,x,j,N)
    sAll=0.0;k=1;
    while  k<=1000000
        s=0.0;i=1;bj=b[j];
        while  i<=N
            @inbounds s+=(w[i,j]*(r[i] + bj*x[i,j])); 
            i=i+1;
        end
        k=k+1;sAll=sAll+s;
    end
    return sAll;
end

I wrote a bunch of similar magical helper functions. Can the compiler automatically optimize this code?


#2

Without information about the variables you’re using, it’s impossible to know. Can you produce a reproducible example?


#3

Please take a look at PSA: make it easier to help you otherwise it will be very hard to provide meaningful help.


#4

Thanks a lot. This is the test code:

w=rand(300,300);x=rand(300,300);
r=vec(rand(300,1));b=rand(300,1);
j=12;N=300;

function orig()    
    sAll=0.0;k=1;
    while  k<=1000000
        sAll=sAll+sum(w[:,j].*(r + b[j]*x[:,j]));
        k=k+1;
    end
    return sAll;
end

function w_r_b_x_n(w,r,b,x,j,N)
    sAll=0.0;k=1;
    while  k<=1000000
        s=0.0;i=1;bj=b[j];
        while  i<=N
            @inbounds s+=(w[i,j]*(r[i] + bj*x[i,j])); 
            i=i+1;
        end
        k=k+1;sAll=sAll+s;
    end
    return sAll;
end

@time orig()
@time w_r_b_x_n(w,r,b,x,j,N)

#5

Please add three backtics like this ``` before and after your code.

You needn’t write all your code in loops to achieve performance. You can use the sum intrinsic which is very fast and more accurate, just take care of unnecessary allocations. Write the sum like this:

sum( @. @views w[:,j] * (r + b[j]*d[:,j]) )

and it will be almost as fast as the loop.

Curerntly, @views still allocates, but this may change in the future and you will not need to write loops for performance.


#6

You can also do sum(w[i, j] * (r + b[j] * x[i, j]) for i in 1:N) to avoid all allocations.

(BTW, instead of while and the manual handling of i, better do for i in 1:N.)


#7

If I get it correctly, I think it will not give the exact same result as sum(array), see this for example.


#8

Indeed currently it won’t, but depending on the situation it may or may not matter.


#9

Seems like making all your globals const would fix performance more easily.