Making efficient some small subroutines

O_Julia · May 6, 2020, 10:41pm

Hi, I am trying to convert some Python+numpy subroutines to Julia. These subroutines are most important part of my main progam and take much time. In the current version these codes are run faster in non-optimized Python than Julia.
I have tried to improve the Julia codes to the best of my capabilities. @code_warntype does not show a type instability. For the sake of minimal working example, I have simplified the codes as much as possible and gave them in below. I will appreciate if someone can help on improving these codes.

using SparseArrays
using Printf
using BenchmarkTools
using LoopVectorization
using NearestNeighbors
using LinearAlgebra


function K(r::Matrix,od::Int64,γ::Float64,e::Float64)::Array{Float64,2}

    phi = similar(r)
    @avx for i in eachindex(r)
        phi[i] =  r[i]^od*γ + exp(-(r[i]*e)^2)
    end
    return phi
end

function D(r::Float64,rx::Float64,n::Int64,γ::Float64,e::Float64)

    dp = zero(r);
    if n == 1
        dp = 7.0*γ*rx*r^5 - 2.0*e^2*rx*exp(-e^2*(r^2))#
    elseif n == 2
        dp = 4.0*(e^2)^2*rx^2 *exp(-e^2*(r^2)) - 2.0*e^2*exp(-e^2*(r^2)) + 7.0*γ*r^5 + (35.0*γ*rx^2)*r^3
    else
        println("enough")
    end

    return dp
end

function H(i::Int64,n::Int64,X::Array{Float64,1},Y::Array{Float64,1},p::Array{Int64,2},γ::Float64,sh::Float64)::NTuple{2,Array{Float64,2}}
    h1,h2 = zeros(size(p)),zeros(size(p))

    @inbounds for j in 1:n
        a = (X[i]-X[p[j]]); b=(Y[i]-Y[p[j]]); c=sqrt(a^2 + b^2);
        h1[j] = D(c, a, 2,γ,sh)
        h2[j] = D(c, b, 2,γ,sh)
    end
    return h1,h2
end

function R(n::Int64,X::Array{Float64,1},Y::Array{Float64,1},p::Array{Int64,2})::Array{Float64,2}
    r =zeros(n,n)
    @inbounds for j in 1:n, k in 1:n
            r[k,j] = sqrt((X[p[k]]-X[p[j]])^2 + (Y[p[k]]-Y[p[j]])^2)
    end
    return r
end

function main(st::Array{Int64,2},X::Array{Float64,1},Y::Array{Float64,1},ip::Array{Int64,1},od::Int64,γ::Float64,sh::Float64,n::Int64)
    N = length(X)
    D = spzeros(N,N)
    p = zeros(Int,n,1);
    for i in ip
        @inbounds for j in 1:n
            p[j] = st[i,j];
        end
        r = R(n,X,Y,p)
        h1,h2 = H(i,n,X,Y,p,γ,sh)
        B =  K(r,od,γ,sh)
        Dl =  h1+h2;
        D[i,p] = ldiv!(lu!(B), Dl);# ; #instead of (B\Dl)

    end
    return D
end


const x1, x2, y1, y2 ,sh, od, γ, N, n = -64, 64, -64, 64, 1.0, 7, 2.0, 200, 50
a,b = range(x1,stop=x2,length=N),range(y1,stop=y2,length=N);

x,y = repeat(a,1,length(a)),repeat(b',length(b),1);
PT = reduce(vcat, [x y] for y in b for x in a) #

ip = filter(i -> PT[i,1]>x1 && PT[i,1]<x2 && PT[i,2]>y1 && PT[i,2]<y2, axes(PT, 1));

X,Y =  PT[:,1], PT[:,2]; data = [X';Y'];

kd = KDTree(data)
SS, _ = knn(kd, data, n, true);
SS1 = Int.(vcat(SS'...));

Here are timings:

@benchmark main(SS1,X,Y,ip,od,γ,sh,n)
BenchmarkTools.Trial: 
  memory estimate:  1.55 GiB
  allocs estimate:  392087
  --------------
  minimum time:     16.661 s (0.18% GC)
  median time:      16.661 s (0.18% GC)
  mean time:        16.661 s (0.18% GC)
  maximum time:     16.661 s (0.18% GC)


  --------------
  samples:          1
  evals/sample:     1

As I said previously, the Python codes are faster and if I increase N and n than the speed of Python become more dominant than speed of Julia.

mcabbott · May 6, 2020, 11:07pm

Can you profile it to see where it spends its time? Profiling · The Julia Language

Note that there’s no need to provide exact types like this for functions, it won’t change what gets compiled at all. You can write ::Vector on a few to remind yourself, and of course if you have several methods of H and want to dispatch to the right one.

Also, all these small functions create & return arrays. If they are what’s expensive, then you could make these arrays before the for i in ip loop, and re-use them.

O_Julia · May 6, 2020, 11:25pm

Thank you for suggestion on types. The output of @profile is very long, I could not extract a useful tip from the output. And the site does not allow to paste the output here.

then you could make these arrays before the for i in ip loop, and re-use them.

Actually, i’m not sure how to do this and will it bring performance?

mcabbott · May 6, 2020, 11:32pm

I mean you replace this:

function K(r::Matrix,od::Int64,γ::Float64,e::Float64)::Array{Float64,2}
    phi = similar(r)

with this:

function K!(phi:: Martrix, r::Martrix, od, γ, e)

and define phi or rather B just once. Then it gets filled with new numbers on every loop.

waralex · May 6, 2020, 11:35pm

I’m not a mathematician and I don’t really understand what your code does. But the simplest research shows that the whole problem is in this line: D[i,p] = ldiv!(lu!(B), Dl) . If you comment it, the execution time is reduced by 20 times

waralex · May 6, 2020, 11:39pm

Directly. Because memory allocation is a rather expensive operation. But in your case it seems that this will not give a win noticeable against the background of the execution time of the line from my previous comment

O_Julia · May 6, 2020, 11:42pm

D[i,p] = ldiv!(lu!(B), Dl)

this line is crucial, if I use D[i,p] =B\D1 . I got following timings. I do not think of another alternative about this line. If you have suggestions I am opened to them.


@benchmark main(SS1,X,Y,ip,od,γ,sh,n)
BenchmarkTools.Trial: 
  memory estimate:  2.30 GiB
  allocs estimate:  509699
  --------------
  minimum time:     17.367 s (0.28% GC)
  median time:      17.367 s (0.28% GC)
  mean time:        17.367 s (0.28% GC)
  maximum time:     17.367 s (0.28% GC)
  --------------
  samples:          1
  evals/sample:     1

mcabbott · May 6, 2020, 11:52pm

Oh right, commenting that out is a great idea.

MKL can be quite a bit faster on factorization things, are you using MKL.jl? On my computer (not sure if these are realistic sizes):

julia> B = rand(100,100); D = rand(100,100);

julia> @btime ldiv!(lu!($B), $D);
  175.970 μs (1 allocation: 896 bytes)

julia> 0.000175 * length(ip) # not quite 16s
6.8607

julia> BLAS.vendor()
:openblas64

vs:

julia> @btime ldiv!(lu!($B), $D);
  89.549 μs (2 allocations: 928 bytes)

julia> BLAS.vendor()
:mkl

waralex · May 7, 2020, 12:02am

Once again, I am not a mathematician, the last time I encountered linear algebra at University was 18 years ago. Therefore, I can hardly offer a solution in this case. But I have a lot of experience in optimization and finding performance bottlenecks and experience shows that commenting out lines is one of the most effective methods
Perhaps the solution from @mcabbott will help you

O_Julia · May 7, 2020, 12:37am

Thank you for your answer. Actually, in full version of my program, I have to solve several linear systems for each i.

O_Julia · May 7, 2020, 12:38am

I am using mkl.

julia> BLAS.vendor()
:mkl

waralex · May 7, 2020, 5:56pm

And I as it turned out used openblas64, and this distorted the picture. After enabling mkl, the task became more interesting.
With mkl, this ldiv!(lu!(B), Dl) calculation is no longer a problem. But the problem remains the D[i,p] = ... assignment itself

I can offer you this solution:

........
function pre_make_sparse(ip, st, n)
    size = length(ip) * n
    I = zeros(Int, size)
    J = zeros(Int, size)
    V = zeros(Float64, size)

    p = zeros(Int,n,1);
    cursor = 1
    for i in ip
        @inbounds for j in 1:n
            I[cursor] = i
            J[cursor] = st[i, j]
            cursor += 1
        end
    end
    return sparse(I,J,V)
    
end
function main(st::Array{Int64,2},X::Array{Float64,1},Y::Array{Float64,1},ip::Array{Int64,1},od::Int64,γ::Float64,sh::Float64,n::Int64)
    N = length(X)
    #D = spzeros(N,N)
    D = pre_make_sparse(ip, st, n)
    p = zeros(Int,n,1);
    for i in ip
        @inbounds for j in 1:n
            p[j] = st[i,j];
        end
        r = R(n,X,Y,p)
        h1,h2 = H(i,n,X,Y,p,γ,sh)
        B =  K(r,od,γ,sh)
        Dl =  h1+h2;
        D[i,p] = ldiv!(lu!(B), Dl);# ; #instead of (B\Dl)

    end
    return D
end
.........
@benchmark main(SS1,X,Y,ip,od,γ,sh,n)

BenchmarkTools.Trial:
  memory estimate:  1.65 GiB
  allocs estimate:  392069
  --------------
  minimum time:     2.539 s (5.53% GC)
  median time:      2.644 s (7.92% GC)
  mean time:        2.644 s (7.92% GC)
  maximum time:     2.749 s (10.12% GC)
  --------------
  samples:          2
  evals/sample:     1

From the documentation:

So I just create the sparse matrix D immediately with existing (and equal to 0) elements that will be used in the loop. It is good that the indexes of these elements are known in advance. I hope that in the full version this is also true.

waralex · May 7, 2020, 6:07pm

Another option, perhaps more universal , is not to create a matrix immediately, but to fill the same arrays I, J, V in the main calculation loop and create a sparse matrix from them after the loop. Just in case I, J, V are arrays such that D[I[k], J[k]] = V[k]

O_Julia · May 7, 2020, 10:01pm

How can I obtain exactly same D as in my main() by using your main() function? Because their size are different.

waralex · May 7, 2020, 11:16pm

Did you mean the sizes that are n and m? If so, just add N variable from main function as argument to pre_make_sparse and replace return sparse(I,J,V) with return sparse(I,J,V, N, N)

eliassno · May 7, 2020, 11:33pm

@waralex’s suggestion to precompute I and J will significantly reduce the number of allocations in main().

r and B appear to by symmetric in your example. You could try changing R(n,X,Y,p) to something like:

"""

    symmetrize!(a)

Making a matrix symmetric, in-place.

Code by Petr Krysl.
https://discourse.julialang.org/t/in-place-operations-on-matrices/30908/4

"""

symmetrize!(a) = begin
    @inbounds for c in 1:size(a, 2)
    	for r in c:size(a, 1)
    		a[r, c] += a[c, r]
    		a[c, r] =  a[r, c]
    	end
    end
    a
end

function R(n::Int64, X::Vector, Y::Vector, p::Matrix)::Matrix{Float64}
    r = zeros(n,n)
    @inbounds for j in 1:(n-1), k in (j+1):n
            r[k,j] = sqrt((X[p[k]]-X[p[j]])^2 + (Y[p[k]]-Y[p[j]])^2)
    end
    symmetrize!(r)
end

You could try using the LAPACK functions, like sysv!.

eliassno · May 7, 2020, 11:38pm

Adding the size arguments in sparse should be enough in pre_make_sparse.

If there’s a risk of having duplicate indices, choosing the combine function may be important when I,J,V are built incrementally in the main calculation.
So maybe return sparse(I,J,V,N,N,combine)?

waralex · May 7, 2020, 11:45pm

In the case when it is simply pre-created with elements equal to 0, combine is not needed, I think

I haven’t worked with SparseMatrixCSC before. But I have a feeling that the problem is not so much in the allocations, but in the numerous copying of internal vectors that occurs with random-access inserts.

If I understand the documentation correctly, then the algorithmic complexity of inserting an element that has not yet been in the matrix should be O(N) where N is the number of elements that actually exist in the matrix

O_Julia · May 8, 2020, 12:10am

@waralex Thank you very much for your effort. Your approach greatly increase speed of minimal working example. I hope, I can apply this approach properly to my full program. Thank you again.

O_Julia · May 8, 2020, 12:10am

Thank you, I will try this.

Topic		Replies	Views
Speeding up some non-optimized Julia functions Performance	47	1200	July 19, 2022
Sparse matrix-vector product: much more slow than Matlab Performance matlab , optimization	24	4545	December 20, 2017
How to efficiently construct a large SparseArray? Packages for this? Performance package , performance , parallel , sparse	20	1761	May 15, 2022
Matrix computation Performance	6	766	July 27, 2020
Help with this code (deallocation/pointers on sparse matrices) Performance array , sparse , matrices , sparsearrays	31	695	February 3, 2024

Making efficient some small subroutines

Related topics