# Making efficient some small subroutines

Hi, I am trying to convert some Python+numpy subroutines to Julia. These subroutines are most important part of my main progam and take much time. In the current version these codes are run faster in non-optimized Python than Julia.
I have tried to improve the Julia codes to the best of my capabilities. `@code_warntype` does not show a type instability. For the sake of minimal working example, I have simplified the codes as much as possible and gave them in below. I will appreciate if someone can help on improving these codes.

``````using SparseArrays
using Printf
using BenchmarkTools
using LoopVectorization
using NearestNeighbors
using LinearAlgebra

function K(r::Matrix,od::Int64,γ::Float64,e::Float64)::Array{Float64,2}

phi = similar(r)
@avx for i in eachindex(r)
phi[i] =  r[i]^od*γ + exp(-(r[i]*e)^2)
end
return phi
end

function D(r::Float64,rx::Float64,n::Int64,γ::Float64,e::Float64)

dp = zero(r);
if n == 1
dp = 7.0*γ*rx*r^5 - 2.0*e^2*rx*exp(-e^2*(r^2))#
elseif n == 2
dp = 4.0*(e^2)^2*rx^2 *exp(-e^2*(r^2)) - 2.0*e^2*exp(-e^2*(r^2)) + 7.0*γ*r^5 + (35.0*γ*rx^2)*r^3
else
println("enough")
end

return dp
end

function H(i::Int64,n::Int64,X::Array{Float64,1},Y::Array{Float64,1},p::Array{Int64,2},γ::Float64,sh::Float64)::NTuple{2,Array{Float64,2}}
h1,h2 = zeros(size(p)),zeros(size(p))

@inbounds for j in 1:n
a = (X[i]-X[p[j]]); b=(Y[i]-Y[p[j]]); c=sqrt(a^2 + b^2);
h1[j] = D(c, a, 2,γ,sh)
h2[j] = D(c, b, 2,γ,sh)
end
return h1,h2
end

function R(n::Int64,X::Array{Float64,1},Y::Array{Float64,1},p::Array{Int64,2})::Array{Float64,2}
r =zeros(n,n)
@inbounds for j in 1:n, k in 1:n
r[k,j] = sqrt((X[p[k]]-X[p[j]])^2 + (Y[p[k]]-Y[p[j]])^2)
end
return r
end

function main(st::Array{Int64,2},X::Array{Float64,1},Y::Array{Float64,1},ip::Array{Int64,1},od::Int64,γ::Float64,sh::Float64,n::Int64)
N = length(X)
D = spzeros(N,N)
p = zeros(Int,n,1);
for i in ip
@inbounds for j in 1:n
p[j] = st[i,j];
end
r = R(n,X,Y,p)
h1,h2 = H(i,n,X,Y,p,γ,sh)
B =  K(r,od,γ,sh)
Dl =  h1+h2;
D[i,p] = ldiv!(lu!(B), Dl);# ; #instead of (B\Dl)

end
return D
end

const x1, x2, y1, y2 ,sh, od, γ, N, n = -64, 64, -64, 64, 1.0, 7, 2.0, 200, 50
a,b = range(x1,stop=x2,length=N),range(y1,stop=y2,length=N);

x,y = repeat(a,1,length(a)),repeat(b',length(b),1);
PT = reduce(vcat, [x y] for y in b for x in a) #

ip = filter(i -> PT[i,1]>x1 && PT[i,1]<x2 && PT[i,2]>y1 && PT[i,2]<y2, axes(PT, 1));

X,Y =  PT[:,1], PT[:,2]; data = [X';Y'];

kd = KDTree(data)
SS, _ = knn(kd, data, n, true);
SS1 = Int.(vcat(SS'...));
``````

Here are timings:

``````@benchmark main(SS1,X,Y,ip,od,γ,sh,n)
BenchmarkTools.Trial:
memory estimate:  1.55 GiB
allocs estimate:  392087
--------------
minimum time:     16.661 s (0.18% GC)
median time:      16.661 s (0.18% GC)
mean time:        16.661 s (0.18% GC)
maximum time:     16.661 s (0.18% GC)

--------------
samples:          1
evals/sample:     1
``````

As I said previously, the Python codes are faster and if I increase `N` and `n` than the speed of Python become more dominant than speed of Julia.

Can you profile it to see where it spends its time? https://docs.julialang.org/en/v1/manual/profile/#Profiling-1

Note that there’s no need to provide exact types like this for functions, it won’t change what gets compiled at all. You can write `::Vector` on a few to remind yourself, and of course if you have several methods of `H` and want to dispatch to the right one.

Also, all these small functions create & return arrays. If they are what’s expensive, then you could make these arrays before the `for i in ip` loop, and re-use them.

1 Like

Thank you for suggestion on types. The output of @profile is very long, I could not extract a useful tip from the output. And the site does not allow to paste the output here.

then you could make these arrays before the `for i in ip` loop, and re-use them.

Actually, i’m not sure how to do this and will it bring performance?

I mean you replace this:

``````function K(r::Matrix,od::Int64,γ::Float64,e::Float64)::Array{Float64,2}
phi = similar(r)
``````

with this:

``````function K!(phi:: Martrix, r::Martrix, od, γ, e)
``````

and define `phi` or rather `B` just once. Then it gets filled with new numbers on every loop.

2 Likes

I’m not a mathematician and I don’t really understand what your code does. But the simplest research shows that the whole problem is in this line: `D[i,p] = ldiv!(lu!(B), Dl)` . If you comment it, the execution time is reduced by 20 times

Directly. Because memory allocation is a rather expensive operation. But in your case it seems that this will not give a win noticeable against the background of the execution time of the line from my previous comment

`D[i,p] = ldiv!(lu!(B), Dl)`

this line is crucial, if I use `D[i,p] =B\D1` . I got following timings. I do not think of another alternative about this line. If you have suggestions I am opened to them.

``````
@benchmark main(SS1,X,Y,ip,od,γ,sh,n)
BenchmarkTools.Trial:
memory estimate:  2.30 GiB
allocs estimate:  509699
--------------
minimum time:     17.367 s (0.28% GC)
median time:      17.367 s (0.28% GC)
mean time:        17.367 s (0.28% GC)
maximum time:     17.367 s (0.28% GC)
--------------
samples:          1
evals/sample:     1
``````

Oh right, commenting that out is a great idea.

MKL can be quite a bit faster on factorization things, are you using MKL.jl? On my computer (not sure if these are realistic sizes):

``````julia> B = rand(100,100); D = rand(100,100);

julia> @btime ldiv!(lu!(\$B), \$D);
175.970 μs (1 allocation: 896 bytes)

julia> 0.000175 * length(ip) # not quite 16s
6.8607

julia> BLAS.vendor()
:openblas64
``````

vs:

``````julia> @btime ldiv!(lu!(\$B), \$D);
89.549 μs (2 allocations: 928 bytes)

julia> BLAS.vendor()
:mkl
``````

Once again, I am not a mathematician, the last time I encountered linear algebra at University was 18 years ago. Therefore, I can hardly offer a solution in this case. But I have a lot of experience in optimization and finding performance bottlenecks and experience shows that commenting out lines is one of the most effective methods 1 Like

Thank you for your answer. Actually, in full version of my program, I have to solve several linear systems for each `i`.

I am using mkl.

``````julia> BLAS.vendor()
:mkl
``````

And I as it turned out used `openblas64`, and this distorted the picture. After enabling mkl, the task became more interesting.
With mkl, this `ldiv!(lu!(B), Dl)` calculation is no longer a problem. But the problem remains the `D[i,p] = ...` assignment itself

I can offer you this solution:

``````........
function pre_make_sparse(ip, st, n)
size = length(ip) * n
I = zeros(Int, size)
J = zeros(Int, size)
V = zeros(Float64, size)

p = zeros(Int,n,1);
cursor = 1
for i in ip
@inbounds for j in 1:n
I[cursor] = i
J[cursor] = st[i, j]
cursor += 1
end
end
return sparse(I,J,V)

end
function main(st::Array{Int64,2},X::Array{Float64,1},Y::Array{Float64,1},ip::Array{Int64,1},od::Int64,γ::Float64,sh::Float64,n::Int64)
N = length(X)
#D = spzeros(N,N)
D = pre_make_sparse(ip, st, n)
p = zeros(Int,n,1);
for i in ip
@inbounds for j in 1:n
p[j] = st[i,j];
end
r = R(n,X,Y,p)
h1,h2 = H(i,n,X,Y,p,γ,sh)
B =  K(r,od,γ,sh)
Dl =  h1+h2;
D[i,p] = ldiv!(lu!(B), Dl);# ; #instead of (B\Dl)

end
return D
end
.........
@benchmark main(SS1,X,Y,ip,od,γ,sh,n)
``````
``````BenchmarkTools.Trial:
memory estimate:  1.65 GiB
allocs estimate:  392069
--------------
minimum time:     2.539 s (5.53% GC)
median time:      2.644 s (7.92% GC)
mean time:        2.644 s (7.92% GC)
maximum time:     2.749 s (10.12% GC)
--------------
samples:          2
evals/sample:     1
``````

From the documentation:

So I just create the sparse matrix `D` immediately with existing (and equal to 0) elements that will be used in the loop. It is good that the indexes of these elements are known in advance. I hope that in the full version this is also true.

1 Like

Another option, perhaps more universal , is not to create a matrix immediately, but to fill the same arrays I, J, V in the main calculation loop and create a sparse matrix from them after the loop. Just in case I, J, V are arrays such that `D[I[k], J[k]] = V[k]`

1 Like

How can I obtain exactly same `D` as in my `main()` by using your `main()` function? Because their size are different.

Did you mean the sizes that are n and m? If so, just add N variable from main function as argument to pre_make_sparse and replace `return sparse(I,J,V)` with `return sparse(I,J,V, N, N)`

1 Like

@waralex’s suggestion to precompute `I` and `J` will significantly reduce the number of allocations in `main()`.

`r` and `B` appear to by symmetric in your example. You could try changing `R(n,X,Y,p)` to something like:

``````"""

symmetrize!(a)

Making a matrix symmetric, in-place.

Code by Petr Krysl.
https://discourse.julialang.org/t/in-place-operations-on-matrices/30908/4

"""

symmetrize!(a) = begin
@inbounds for c in 1:size(a, 2)
for r in c:size(a, 1)
a[r, c] += a[c, r]
a[c, r] =  a[r, c]
end
end
a
end

function R(n::Int64, X::Vector, Y::Vector, p::Matrix)::Matrix{Float64}
r = zeros(n,n)
@inbounds for j in 1:(n-1), k in (j+1):n
r[k,j] = sqrt((X[p[k]]-X[p[j]])^2 + (Y[p[k]]-Y[p[j]])^2)
end
symmetrize!(r)
end
``````

You could try using the LAPACK functions, like sysv!.

2 Likes

Adding the size arguments in `sparse` should be enough in `pre_make_sparse`.

If there’s a risk of having duplicate indices, choosing the `combine` function may be important when `I`,`J`,`V` are built incrementally in the main calculation.
So maybe return `sparse(I,J,V,N,N,combine)`?

In the case when it is simply pre-created with elements equal to 0, `combine` is not needed, I think

I haven’t worked with SparseMatrixCSC before. But I have a feeling that the problem is not so much in the allocations, but in the numerous copying of internal vectors that occurs with random-access inserts.

If I understand the documentation correctly, then the algorithmic complexity of inserting an element that has not yet been in the matrix should be O(N) where N is the number of elements that actually exist in the matrix

1 Like

@waralex Thank you very much for your effort. Your approach greatly increase speed of minimal working example. I hope, I can apply this approach properly to my full program. Thank you again.

Thank you, I will try this.