Making efficient some small subroutines

Hi, I am trying to convert some Python+numpy subroutines to Julia. These subroutines are most important part of my main progam and take much time. In the current version these codes are run faster in non-optimized Python than Julia.
I have tried to improve the Julia codes to the best of my capabilities. @code_warntype does not show a type instability. For the sake of minimal working example, I have simplified the codes as much as possible and gave them in below. I will appreciate if someone can help on improving these codes.

using SparseArrays
using Printf
using BenchmarkTools
using LoopVectorization
using NearestNeighbors
using LinearAlgebra


function K(r::Matrix,od::Int64,γ::Float64,e::Float64)::Array{Float64,2}

    phi = similar(r)
    @avx for i in eachindex(r)
        phi[i] =  r[i]^od*γ + exp(-(r[i]*e)^2)
    end
    return phi
end

function D(r::Float64,rx::Float64,n::Int64,γ::Float64,e::Float64)

    dp = zero(r);
    if n == 1
        dp = 7.0*γ*rx*r^5 - 2.0*e^2*rx*exp(-e^2*(r^2))#
    elseif n == 2
        dp = 4.0*(e^2)^2*rx^2 *exp(-e^2*(r^2)) - 2.0*e^2*exp(-e^2*(r^2)) + 7.0*γ*r^5 + (35.0*γ*rx^2)*r^3
    else
        println("enough")
    end

    return dp
end

function H(i::Int64,n::Int64,X::Array{Float64,1},Y::Array{Float64,1},p::Array{Int64,2},γ::Float64,sh::Float64)::NTuple{2,Array{Float64,2}}
    h1,h2 = zeros(size(p)),zeros(size(p))

    @inbounds for j in 1:n
        a = (X[i]-X[p[j]]); b=(Y[i]-Y[p[j]]); c=sqrt(a^2 + b^2);
        h1[j] = D(c, a, 2,γ,sh)
        h2[j] = D(c, b, 2,γ,sh)
    end
    return h1,h2
end

function R(n::Int64,X::Array{Float64,1},Y::Array{Float64,1},p::Array{Int64,2})::Array{Float64,2}
    r =zeros(n,n)
    @inbounds for j in 1:n, k in 1:n
            r[k,j] = sqrt((X[p[k]]-X[p[j]])^2 + (Y[p[k]]-Y[p[j]])^2)
    end
    return r
end

function main(st::Array{Int64,2},X::Array{Float64,1},Y::Array{Float64,1},ip::Array{Int64,1},od::Int64,γ::Float64,sh::Float64,n::Int64)
    N = length(X)
    D = spzeros(N,N)
    p = zeros(Int,n,1);
    for i in ip
        @inbounds for j in 1:n
            p[j] = st[i,j];
        end
        r = R(n,X,Y,p)
        h1,h2 = H(i,n,X,Y,p,γ,sh)
        B =  K(r,od,γ,sh)
        Dl =  h1+h2;
        D[i,p] = ldiv!(lu!(B), Dl);# ; #instead of (B\Dl)

    end
    return D
end


const x1, x2, y1, y2 ,sh, od, γ, N, n = -64, 64, -64, 64, 1.0, 7, 2.0, 200, 50
a,b = range(x1,stop=x2,length=N),range(y1,stop=y2,length=N);

x,y = repeat(a,1,length(a)),repeat(b',length(b),1);
PT = reduce(vcat, [x y] for y in b for x in a) #

ip = filter(i -> PT[i,1]>x1 && PT[i,1]<x2 && PT[i,2]>y1 && PT[i,2]<y2, axes(PT, 1));

X,Y =  PT[:,1], PT[:,2]; data = [X';Y'];

kd = KDTree(data)
SS, _ = knn(kd, data, n, true);
SS1 = Int.(vcat(SS'...));

Here are timings:

@benchmark main(SS1,X,Y,ip,od,γ,sh,n)
BenchmarkTools.Trial: 
  memory estimate:  1.55 GiB
  allocs estimate:  392087
  --------------
  minimum time:     16.661 s (0.18% GC)
  median time:      16.661 s (0.18% GC)
  mean time:        16.661 s (0.18% GC)
  maximum time:     16.661 s (0.18% GC)


  --------------
  samples:          1
  evals/sample:     1

As I said previously, the Python codes are faster and if I increase N and n than the speed of Python become more dominant than speed of Julia.

Can you profile it to see where it spends its time? https://docs.julialang.org/en/v1/manual/profile/#Profiling-1

Note that there’s no need to provide exact types like this for functions, it won’t change what gets compiled at all. You can write ::Vector on a few to remind yourself, and of course if you have several methods of H and want to dispatch to the right one.

Also, all these small functions create & return arrays. If they are what’s expensive, then you could make these arrays before the for i in ip loop, and re-use them.

1 Like

Thank you for suggestion on types. The output of @profile is very long, I could not extract a useful tip from the output. And the site does not allow to paste the output here.

then you could make these arrays before the for i in ip loop, and re-use them.

Actually, i’m not sure how to do this and will it bring performance?

I mean you replace this:

function K(r::Matrix,od::Int64,γ::Float64,e::Float64)::Array{Float64,2}
    phi = similar(r)

with this:

function K!(phi:: Martrix, r::Martrix, od, γ, e)

and define phi or rather B just once. Then it gets filled with new numbers on every loop.

2 Likes

I’m not a mathematician and I don’t really understand what your code does. But the simplest research shows that the whole problem is in this line: D[i,p] = ldiv!(lu!(B), Dl) . If you comment it, the execution time is reduced by 20 times

Directly. Because memory allocation is a rather expensive operation. But in your case it seems that this will not give a win noticeable against the background of the execution time of the line from my previous comment

D[i,p] = ldiv!(lu!(B), Dl)

this line is crucial, if I use D[i,p] =B\D1 . I got following timings. I do not think of another alternative about this line. If you have suggestions I am opened to them.


@benchmark main(SS1,X,Y,ip,od,γ,sh,n)
BenchmarkTools.Trial: 
  memory estimate:  2.30 GiB
  allocs estimate:  509699
  --------------
  minimum time:     17.367 s (0.28% GC)
  median time:      17.367 s (0.28% GC)
  mean time:        17.367 s (0.28% GC)
  maximum time:     17.367 s (0.28% GC)
  --------------
  samples:          1
  evals/sample:     1

Oh right, commenting that out is a great idea.

MKL can be quite a bit faster on factorization things, are you using MKL.jl? On my computer (not sure if these are realistic sizes):

julia> B = rand(100,100); D = rand(100,100);

julia> @btime ldiv!(lu!($B), $D);
  175.970 μs (1 allocation: 896 bytes)

julia> 0.000175 * length(ip) # not quite 16s
6.8607

julia> BLAS.vendor()
:openblas64

vs:

julia> @btime ldiv!(lu!($B), $D);
  89.549 μs (2 allocations: 928 bytes)

julia> BLAS.vendor()
:mkl

Once again, I am not a mathematician, the last time I encountered linear algebra at University was 18 years ago. Therefore, I can hardly offer a solution in this case. But I have a lot of experience in optimization and finding performance bottlenecks and experience shows that commenting out lines is one of the most effective methods :slight_smile:
Perhaps the solution from @mcabbott will help you

1 Like

Thank you for your answer. Actually, in full version of my program, I have to solve several linear systems for each i.

I am using mkl.

julia> BLAS.vendor()
:mkl

And I as it turned out used openblas64, and this distorted the picture. After enabling mkl, the task became more interesting.
With mkl, this ldiv!(lu!(B), Dl) calculation is no longer a problem. But the problem remains the D[i,p] = ... assignment itself

I can offer you this solution:

........
function pre_make_sparse(ip, st, n)
    size = length(ip) * n
    I = zeros(Int, size)
    J = zeros(Int, size)
    V = zeros(Float64, size)

    p = zeros(Int,n,1);
    cursor = 1
    for i in ip
        @inbounds for j in 1:n
            I[cursor] = i
            J[cursor] = st[i, j]
            cursor += 1
        end
    end
    return sparse(I,J,V)
    
end
function main(st::Array{Int64,2},X::Array{Float64,1},Y::Array{Float64,1},ip::Array{Int64,1},od::Int64,γ::Float64,sh::Float64,n::Int64)
    N = length(X)
    #D = spzeros(N,N)
    D = pre_make_sparse(ip, st, n)
    p = zeros(Int,n,1);
    for i in ip
        @inbounds for j in 1:n
            p[j] = st[i,j];
        end
        r = R(n,X,Y,p)
        h1,h2 = H(i,n,X,Y,p,γ,sh)
        B =  K(r,od,γ,sh)
        Dl =  h1+h2;
        D[i,p] = ldiv!(lu!(B), Dl);# ; #instead of (B\Dl)

    end
    return D
end
.........
@benchmark main(SS1,X,Y,ip,od,γ,sh,n)
BenchmarkTools.Trial:
  memory estimate:  1.65 GiB
  allocs estimate:  392069
  --------------
  minimum time:     2.539 s (5.53% GC)
  median time:      2.644 s (7.92% GC)
  mean time:        2.644 s (7.92% GC)
  maximum time:     2.749 s (10.12% GC)
  --------------
  samples:          2
  evals/sample:     1

From the documentation:

So I just create the sparse matrix D immediately with existing (and equal to 0) elements that will be used in the loop. It is good that the indexes of these elements are known in advance. I hope that in the full version this is also true.

1 Like

Another option, perhaps more universal , is not to create a matrix immediately, but to fill the same arrays I, J, V in the main calculation loop and create a sparse matrix from them after the loop. Just in case I, J, V are arrays such that D[I[k], J[k]] = V[k]

1 Like

How can I obtain exactly same D as in my main() by using your main() function? Because their size are different.

Did you mean the sizes that are n and m? If so, just add N variable from main function as argument to pre_make_sparse and replace return sparse(I,J,V) with return sparse(I,J,V, N, N)

1 Like

@waralex’s suggestion to precompute I and J will significantly reduce the number of allocations in main().

r and B appear to by symmetric in your example. You could try changing R(n,X,Y,p) to something like:

"""

    symmetrize!(a)

Making a matrix symmetric, in-place.

Code by Petr Krysl.
https://discourse.julialang.org/t/in-place-operations-on-matrices/30908/4

"""

symmetrize!(a) = begin
    @inbounds for c in 1:size(a, 2)
    	for r in c:size(a, 1)
    		a[r, c] += a[c, r]
    		a[c, r] =  a[r, c]
    	end
    end
    a
end

function R(n::Int64, X::Vector, Y::Vector, p::Matrix)::Matrix{Float64}
    r = zeros(n,n)
    @inbounds for j in 1:(n-1), k in (j+1):n
            r[k,j] = sqrt((X[p[k]]-X[p[j]])^2 + (Y[p[k]]-Y[p[j]])^2)
    end
    symmetrize!(r)
end

You could try using the LAPACK functions, like sysv!.

2 Likes

Adding the size arguments in sparse should be enough in pre_make_sparse.

If there’s a risk of having duplicate indices, choosing the combine function may be important when I,J,V are built incrementally in the main calculation.
So maybe return sparse(I,J,V,N,N,combine)?

In the case when it is simply pre-created with elements equal to 0, combine is not needed, I think

I haven’t worked with SparseMatrixCSC before. But I have a feeling that the problem is not so much in the allocations, but in the numerous copying of internal vectors that occurs with random-access inserts.

If I understand the documentation correctly, then the algorithmic complexity of inserting an element that has not yet been in the matrix should be O(N) where N is the number of elements that actually exist in the matrix

1 Like

@waralex Thank you very much for your effort. Your approach greatly increase speed of minimal working example. I hope, I can apply this approach properly to my full program. Thank you again.

Thank you, I will try this.