Adding to sparse matrix efficiently and IterativeSolvers.gmres speed vs MATLAB

I am trying to convert one of my MATLAB scripts to Julia. I can get it to the same speed but not faster at the moment.
I know where the bottlenecks are, filling my sparse matrix with new entries and zeroing this in a loop. Example below:

using LinearAlgebra
using SparseArrays

#Init vars for loop  -------------->
N=300;
#Sparse Matrix (to be filled)
J = spzeros(N,N); #zeros 

#Matrix of computed values
A=rand(N,N);

#Precomputed vectors
y=rand(N);
x=ones(N);
dx=zeros(N);
zers=zeros(N);

I=zeros(Int64, N)
S=copy(I);
I=convert(Array{Bool,1},I) #convert to bool

#From now on inside loop -------------->

#Values of y change... 

#Compute a flag of parts to fill
S.= y.>0.5;  
I.= (S.==false);                     		
Indx=findall(I)

#Function I would like to speed up
@time WorkOnEntries(J,A,x,y,Indx)
function WorkOnEntries(J,A,x,y,Indx)

	for i=eachindex(Indx)
		local idxi=Indx[i];
		cons=(((y[idxi]^2) +(x[idxi]^2))^0.5);
		p=(x[idxi]/cons)-1.0;
		q=(y[idxi]/cons)-1.0;
		for j=eachindex(Indx)
			local idxj=Indx[j];
			J[idxi,idxj]=A[idxi,idxj]*q; #can use J (or A)[CartesianIndex(idxi,idxi)]
		end
		J[idxi,idxi] +=p; #can use J[CartesianIndex(idxi,idxi)]
	end

end

It is the function (WorkOnEntries.jl) that takes the time. Is there a simple way to speed this up and remove the allocations?
Secondly I then pass the matrix ‘J’ into IterativeSolvers.gmres!, i.e.

#Then passed to gmres 
restart=30;
using IterativeSolvers
@time IterativeSolvers.gmres!(dx,J,y,initially_zero=true,restart=restart,tol=1e-6);

If I compare the result to MATLAB (code at bottom of file) the solver is sometimes slower (esp for the close to dense matrices I have). Secondly, should the output vector of IterativeSolvers.gmres match MATLAB’s gmres?

N=300:
Julia: 0.013295
MATLAB: 0.116265

N=1000:
Julia: 0.424527
MATLAB: 0.274945

Is there a better way to do this?

MATLAB code I used for comparison:

close;clear all

%Init vars for loop  -------------->
N=300;
%Sparse Matrix (to be filled)
J = zeros(N,N); %zeros 

%Matrix of computed values
A=rand(N,N);

%Precomputed vectors
y=rand(N,1);
x=ones(N,1);
dx=zeros(N,1);
zers=zeros(N,1);

%From now on inside loop -------------->

%Values of y change... 

%Compute a flag of parts to fill
S= y>0.5;  
I       = find(S==0);    
p       = (x(I)./((y(I).^2+x(I).^2).^0.5))-1; 
q       = (y(I)./((y(I).^2+x(I).^2).^0.5))-1; 
J(I,I) = diag(kron(p,1)) + bsxfun(@times,(q),A(I,I));  
J=sparse(J);             

restart=30;
tic
[dx(I), ~] = gmres     ( J(I,I), y(I),restart);
toc

Supposedly faster version, didn’t benchmark. Notice that I flipped the 2 loops to make it more cache friendly, so you should pass copy(A') in instead of A, or just work with the transpose all the time.

using LinearAlgebra
using SparseArrays

N = 300;
A = rand(N,N);
y = rand(N);
x = ones(N);
dx = zeros(N);
zers = zeros(N);

S = y .> 0.5;  
I = S .== false;                     		
Indx = findall(I)

function WorkOnEntries(A, x, y, Indx)
	n = length(Indx) * (length(Indx) + 1)
	I = Int[]
	J = Int[]
	V = eltype(A)[]
	sizehint!(I, n)
	sizehint!(J, n)
	sizehint!(V, n)

	for idxj in Indx
		cons = sqrt(y[idxj]^2 + x[idxj]^2)
		p = x[idxj] / cons - 1.0;
		q = y[idxj] / cons - 1.0;
		for idxi in Indx
			push!(I, idxi)
			push!(J, idxj)
			push!(V, A[idxi, idxj] * q)
		end
		push!(I, idxj)
		push!(J, idxj)
		push!(V, p)
	end

	return sparse(I, J, V)
end

Perhaps you are timing incorrectly, you should exclude the first run when timing a Julia code, this includes compiling time. That said, here are the results on my machine which indicate that Julia is actually 19X faster. Further optimizing the Julia code can bring in an order of magnitude speedup.

  Elapsed time is 0.145361 seconds.   MATLAB
  7.652 ms (0 allocations: 0 bytes)   Julia

This is how to benchmark:


function WorkOnEntries(J,A,x,y,Indx)

    for i=eachindex(Indx)
        local idxi=Indx[i];
        cons=(((y[idxi]^2) +(x[idxi]^2))^0.5);
        p=(x[idxi]/cons)-1.0;
        q=(y[idxi]/cons)-1.0;
        for j=eachindex(Indx)
            local idxj=Indx[j];
            J[idxi,idxj]=A[idxi,idxj]*q; #can use J (or A)[CartesianIndex(idxi,idxi)]
        end
        J[idxi,idxi] +=p; #can use J[CartesianIndex(idxi,idxi)]
    end

end

using LinearAlgebra, SparseArrays, BenchmarkTools

function test()
    #Init vars for loop  -------------->
    N = 1000;
    #Sparse Matrix (to be filled)
    J = spzeros(N,N); #zeros 

    #Matrix of computed values
    A=rand(N,N);

    #Precomputed vectors
    y=rand(N);
    x=ones(N);
    dx=zeros(N);
    zers=zeros(N);

    I=zeros(Int64, N)
    S=copy(I);
    I=convert(Array{Bool,1},I) #convert to bool

    #From now on inside loop -------------->
    #Values of y change... 
    #Compute a flag of parts to fill
    S.= y.>0.5;  
    I.= (S.==false);                            
    Indx=findall(I)

    #Function I would like to speed up
    @btime WorkOnEntries($J,$A,$x,$y,$Indx)
end 

test()

Great, I will try your solution Mohammed. It looks good. I am only concerned with Julia vs matlab in regards to the speed of gmres. Not the whole block of code I posted.
Tim