Sparse matrix-vector product: much more slow than Matlab

matlab
optimization

#1

Hi guys,

I’m very new in Julia and recently I’m doing some sparse matrix-vector calculations and want to transfer to Julia from Matlab. It turns out for the same codes, it takes Matlab about 2 seconds, but about 98 seconds for Julia. I know I must be doing something wrong and hope to get some advise here. The codes are as follows:

for time = 1:200 
        tmp = 0.5 * (sparseR + reshape(q' * sparseS, 199, 199));
        rhs = (sparseM + tmp) * q;
        next = (sparseM - tmp) \ rhs;
end

where q is vector with length 199, sparseS is a 199 X 199 X 199 sparse matrix, sparseR is a 199 X 199 sparse matrix, sparseM is a 199 X 199 sparse matrix.

These codes run very fast in Matlab and I don’t know how to optimize it in Julia. Let me know if any part is confusing for you. I really appreciate any comments or suggestions. Thanks!


#2

Are you timing in global scope?


#3

Please post enough code so it can actually be run (so including the creation of the matrices).


#4

Hi @StefanKarpinski, thanks for your reply! Yes, I’m using @time in global score and it shows 90.948970 seconds (4.01 k allocations: 182.124 MB, 0.11% gc time)


#5

Hi @kristoffer.carlsson, thanks for your suggestion!

Here are the complete codes you can run. (The creation of the matrices is ugly, but it works. Also I’m not timing the creation part. I only care about the for loop part inside the testFun function.)


# The creation of sparse matrices. You can ignore this ugly part.

numGrids = 799;
q = rand(numGrids);
rows1 = zeros(Int64,numGrids*numGrids);
cols1 = zeros(Int64,numGrids*numGrids);
vals1 = zeros(numGrids*numGrids);
rows2 = zeros(Int64,numGrids*numGrids);
cols2 = zeros(Int64,numGrids*numGrids);
vals2 = zeros(numGrids*numGrids);
rows3 = zeros(Int64,numGrids*numGrids*numGrids);
cols3 = zeros(Int64,numGrids*numGrids*numGrids);
vals3 = zeros(numGrids*numGrids*numGrids);

count1 = 0;
count2 = 0;
count3 = 0;

for i = 1:numGrids
    for j = 1:numGrids
        r = rand();
        if (r > 0.965)
            count1 += 1;
            rows1[count1] = i;
            cols1[count1] = j;
            vals1[count1] = r;
        end
        
        r = rand();
        if (r > 0.965)
            count2 += 1;
            rows2[count2] = i;
            cols2[count2] = j;
            vals2[count2] = r;
        end        
    end
    
    for k = 1:(numGrids*numGrids)
        r = rand();
        if (r > 0.977)
            count3 += 1;
            rows3[count3] = i;
            cols3[count3] = k;
            vals3[count3] = r;
        end
    end
end

rows1 = rows1[1:count1];
cols1 = cols1[1:count1];
vals1 = vals1[1:count1];
rows2 = rows2[1:count2];
cols2 = cols2[1:count2];
vals2 = vals2[1:count2];
rows3 = rows3[1:count3];
cols3 = cols3[1:count3];
vals3 = vals3[1:count3];

sparseM = sparse(rows1,cols1,vals1,numGrids,numGrids);
sparseR = sparse(rows2,cols2,vals2,numGrids,numGrids);
sparseS = sparse(rows3,cols3,vals3,numGrids,numGrids*numGrids);

# I want to optimize this testFun function

function testFun(sparseM,sparseR,sparseS,q,numIters,numGrids)
    for time = 1:numIters 
        tmp = 0.5 * 0.05 * (sparseR + reshape(q' * sparseS, numGrids, numGrids));
        rhs = (sparseM + tmp) * q;
        forwardQ = (sparseM - tmp) \ rhs;
        q = forwardQ;
    end
end

# Here is how I'm timing the codes.
@time testFun(sparseM,sparseR,sparseS,q,100,799)

I find that my original vector q has type 799 X 1 Array{Float64,2} and the speed of Julia is very slow. When I change q to be 799-element Array{Float64,1}, the speed is much much faster. I don’t understand why, but hope it can help future readers and hope someone can explain this.

Here are the updated timings:

For the same codes, Matlab takes Elapsed time is 4.509614 seconds.
Julia shows 6.867342 seconds (3.00 k allocations: 4.283 GB, 15.98% gc time)


#6
  1. Please quote your code so that it’s easy to read: PSA: how to quote code with backticks
  2. semicolons at the end of each line are not necessary
  3. When benchmarking code, you will not get accurate results when timing in global scope. That’s why @StefanKarpinski asked you earlier. Instead, put the code you’re timing in a function. Running your code in a function, I see 3.699408 seconds (41.60 k allocations: 3.787 GiB, 5.39% gc time) which is already quite close to what you reported MATLAB as giving.

You are seeing a lot of allocations because your code really does allocate a lot of memory. In particular, you are constructing new matrices to hold a lot of intermediate quantities. Modifying your code to pre-allocate those matrices may help a lot. One easy improvement is to broadcast the first line in your loop to avoid allocating a matrix for (sparseR + reshape(q' * sparseS, 199, 199)) and then another one for 0.5 * 0.05 * (sparseR + reshape(q' * sparseS, 199, 199)):

tmp = 0.5 .* 0.05 .* (sparseR .+ reshape(q' * sparseS, 199, 199))

#7

Whenever you benchmark any Julia code against MATLAB’s, you should use the same libraries for fair comparison. MATLAB uses MKL which is available within Juliapro. I timed your code without any modifications in Juliapro 6.1 against Julia 6.1 and it went down from 5.6 sec to 2.738225 seconds (41.60 k allocations: 3.787 GiB, 13.08% gc time). So, without a doubt, if you further follow some simple performance tips, it will beat MATLAB’s version.


#8

Hi @rdeits, thank you so much for your detailed suggestions! I have quoted my codes and made it more readable now. Also I have updated the timings. I noticed that when I change vector q from 799 X 1 Array{Float64,2} to 799-element Array{Float64,1}, the speed is much much faster.

The updated timings are:

For the same codes, Matlab takes Elapsed time is 4.509614 seconds.
Julia shows 6.867342 seconds (3.00 k allocations: 4.283 GB, 15.98% gc time)

As you have suggested, I should pre-allocate those matrices to avoid memory allocations. You mentioned One easy improvement is to broadcast the first line in your loop to avoid allocating a matrix for (sparseR + reshape(q’ * sparseS, 199, 199)) and then another one for 0.5 * 0.05 * (sparseR + reshape(q’ * sparseS, 199, 199)), what do you mean here by “broadcast the first line in your loop”? Thanks again for your kind help!!


#9

Hi @Seif_Shebl, thanks a lot for your suggestions! I actually plan to run these Julia codes in Cluster, so it may make more sense to run these directly rather than using Juliapro. I’m actually quite happy with Julia’s speed now, but the only issue is too much memory allocations. @rdeits gave me some suggestions about this, but I’m not sure how to do that. I know pre-allocation is a good idea, but should I declare a tmp variable before the for-loop instead of using it directly during the for loop?


#10

Cluster support should definitely be able to compile a version of Julia which uses MKL which will make it much faster in this case. It’s not something that requires JuliaPro. You should just ask them to follow these instructions when building Julia:


#11

The difference between q being a 799x1 matrix vs. a 799 element vector is probably due to the fact that the transpose operator ' works differently on matrices and vectors in Julia v0.6. Transposing a vector (an Array{T, 1} for some type T) creates a RowVector, which just stores a pointer to the original data, but transposing a matrix (an Array{T, 2}) creates a new copy of that matrix. For example:

julia> q_vector = [1,2,3]
3-element Array{Int64,1}:
 1
 2
 3

julia> q_matrix = reshape(q_vector, 3, 1)
3×1 Array{Int64,2}:
 1
 2
 3

julia> q_vector'
1×3 RowVector{Int64,Array{Int64,1}}:
 1  2  3

julia> q_matrix'
1×3 Array{Int64,2}:
 1  2  3

However, please note that this situation will improve in the upcoming Julia v0.7 and I would expect you to get good performance with 799x1 in the future.

As for broadcasting, I mean using the .-calls like f.(x) or x .+ y to avoid allocating lots of temporary arrays. You can learn more about why this is a good idea here: https://julialang.org/blog/2017/01/moredots


#12

Hi @rdeits, thanks for your explanations! I just realized that in my codes, I have something like reshape(q’ * sparseS, 199, 199) and it will transpose a matrix, so it will create a new copy of that matrix in each loop. Maybe this is the reason I need to allocate a lot of memory? Are there some good ideas to avoid this issue?

For the broadcast part, thanks for sharing this! I’m not sure whether it will solve the allocation problem but I will try.


#13

It’s unclear to me why you have the for loop inside testFun. Is it for timing purposes? It just appears to do the same calculation over and over. If it’s for timing, please consider removing the loop, and use @btime from BenchmarkTools.

It’s hard to suggest performance improvements if we don’t know which parts of the code can be modified. I mean, I could just suggest the following improvement:

function testFun(sparseM,sparseR,sparseS,q,numIters,numGrids)
    for time in 1:1 
        tmp = 0.5 * 0.05 * (sparseR + reshape(q' * sparseS, numGrids, numGrids))
        rhs = (sparseM + tmp) * q
        forwardQ = (sparseM - tmp) \ rhs
    end
end

Which returns the same answer – nothing.

Or, in fact, you can replace it with this, which also does nothing at all:

function testFun(sparseM,sparseR,sparseS,q,numIters,numGrids)
    return nothing
end

What are the necessary minimum things that your function needs to accomplish? Is it to calculate forwardQ exactly once?


#14

Hi @DNF, sorry for the confusion! Inside the testFun, it’s actually iterative algorithm. So the codes should be

function testFun(sparseM,sparseR,sparseS,q,numIters,numGrids)
    for time = 1:numIters 
        tmp = 0.5 * 0.05 * (sparseR + reshape(q' * sparseS, numGrids, numGrids));
        rhs = (sparseM + tmp) * q;
        forwardQ = (sparseM - tmp) \ rhs;
        q = forwardQ;
    end
    return q;
end

Now the main issue is memory allocation 6.867342 seconds (3.00 k allocations: 4.283 GB, 15.98% gc time)

Do you have any idea about how to reduce the memory allocation problem? Thanks!


#15

If you pre-allocate as per @rdeits comment, you can reduce memory allocations substantially:

function testFun(sparseM,sparseR,sparseS,q,numIters,numGrids)
    tmp = Array{Float64}(numGrids, numGrids);
    for time = 1:numIters 
        tmp .= 0.5 .* 0.05 .* (sparseR .+ reshape(q'*sparseS, numGrids, numGrids));
        rhs = (sparseM + tmp) * q;
        forwardQ = (sparseM - tmp) \ rhs;
    end
end

# Here is how I'm timing the codes.
@time testFun(sparseM,sparseR,sparseS,q,100,799)

4.076009 seconds (2.20 k allocations: 2.861 GiB, 5.76% gc time)


#16

Hey @ChrisRackauckas, how can I build Julia with MKL using the Julia command in the final gotcha in your 7 gotchas?


#17

You have to re-build Julia to use it with MKL. See https://github.com/JuliaLang/julia#intel-compilers-and-math-kernel-library-mkl


#18

You’ll want to keep going though. Pre-allocate a tmp for sparseM + tmp (or update tmp inplace for that), and also use the same cache for sparseM - tmp. Then, instead of *, use At_mul_B! for q'*sparseS, A_mul_B! for (...) * q, and then choose a factorization method and do an inplace factorization and A_ldiv_B! for (...) \ rhs. That should get rid of almost all allocations.


#19

Hi @Seif_Shebl, thank you very much for your instructions. Is this some new feature in Julia 0.6? I’m using Julia 0.5 and when I run this codes, it didn’t reduce the memory allocations…


#20

Hi @ChrisRackauckas, thank you very much for these instructions. I didn’t know these at all before. By your suggestions, I modified the codes as follows:

function testFun(sparseM,sparseR,sparseS,q,numIters,numGrids)
    longVector = Array{Float64}(1,numGrids*numGrids);
    forwardQ = Array{Float64}(numGrids,1);
    tmp = Array{Float64}(numGrids, numGrids);
    tmp2 = Array{Float64}(numGrids, numGrids);
    tmp3 = Array{Float64}(numGrids, numGrids);
    for time = 1:numIters 
        At_mul_B!(longVector,q,sparseS);
        tmp .= 0.5 .* 0.05 .* (sparseR .+ reshape(longVector, numGrids, numGrids));
        tmp2 .= sparseM .+ tmp;
        tmp3 .= sparseM .- tmp;
        A_mul_B!(forwardQ,tmp2,q);
        A_ldiv_B!(lufact!(tmp3),forwardQ);
    end
end

But I became very very slow (it will take about 14 minutes to finish @time testFun(sparseM,sparseR,sparseS,q,100,799)) Did I make some stupid mistakes here?