Speeding up some non-optimized Julia functions

Hi, I am converting some of my Python codes to Julia language. For minimal working example, I have simplified most time consuming part of my codes and give them below. In the current version. The Python codes are approximately 8x faster than Julia version. Here are the codes

using MKL
using Printf
using SparseArrays
using BenchmarkTools
using NearestNeighbors
using LoopVectorization
import VectorizedRoutines.Matlab: meshgrid


function fun(data,c)
   x = data[:,1]; y = data[:,2]; z = data[:,3]; n = length(x);
   A,A1,B1,C1,A2,B2,C2 = zeros(n,n),zeros(n,n),zeros(n,n),zeros(n,n),zeros(n,n),zeros(n,n),zeros(n,n)

   for i in 1:n
         @inbounds for k in 1:n
           xx = x[k]-x[i]
           yy = y[k]-y[i]
           zz = z[k]-z[i]
           xyz = sqrt(xx^2+yy^2+zz^2)

           A[i,k] = sqrt(xyz^2+c^2)
           A1[i,k] = xx/A[i,k]
           A2[i,k] = 1/A[i,k]-xx^2/A[i,k]^3
           B1[i,k] = yy/A[i,k]
           B2[i,k] = 1/A[i,k]-yy^2/A[i,k]^3
           C1[i,k] = zz/A[i,k]
           C2[i,k] = 1/A[i,k]-zz^2/A[i,k]^3

       end
   end
   return A,A1,B1,C1,A2,B2,C2
end


function fun2(data)
    x = data[:,1]; y = data[:,2]; z = data[:,3];o = zero(x); II = ones(size(x));
    a =   [II x y z x .^2 x.*y x.*z y .^2 y.*z z .^2 x .^3 x .^2 .*y x .^2 .*z x.*y .^2 x.*y.*z x.*z .^2 y .^3 y .^2 .*z y.*z .^2 z .^3]
    a1 =  [o II o o x*2 y z o o o 3*x .^2 2*x.*y 2*x.*z y .^2 y.*z z .^2 o o o o]
    b1 =  [o o II o o x o y*2 z o o x .^2 o x.*y*2 x.*z o 3*y .^2 2*y.*z z .^2 o]
    c1 =  [o o o II o o x o y z*2 o o x .^2 o x.*y x.*z*2 o y .^2 y.*z*2 3*z .^2]
    a2 = [o o o o II*2 o o o o o x*6 2*y 2*z o o o o o o o]
    b2 = [o o o o o o o II*2 o o o o o x*2 o o y*6 2*z o o]
    c2 = [o o o o o o o o o II*2 o o o o o x*2 o o y*2 z*6]

    return a,a1,b1,c1,a2,b2,c2
end

function main(st,data::Array{Float64,2},p::Array{Int64,1},m)
    N = length(data[:,1])
    W,W11,W12,W13,W21,W22,W23 = spzeros(N,N),spzeros(N,N),spzeros(N,N),spzeros(N,N),spzeros(N,N),spzeros(N,N),spzeros(N,N)
    
    dd = zeros(20,20);
    @inbounds for i in p
       id=st[i]
       t_data = data[id,:];
       A,A1,B1,C1,A2,B2,C2 = fun(t_data,m)
       a,a1,b1,c1,a2,b2,c2  = fun2(t_data)
       R = [A a; a' dd];
       ind = filter(j -> id[j] == i,axes(id, 1)); 
       L = [[A;a'][:,ind] [A1;a1'][:,ind] [B1;b1'][:,ind] [C1;c1'][:,ind] [A2;a2'][:,ind] [B2;b2'][:,ind] [C2;c2'][:,ind]];

       soln = (R\L)[1:length(t_data[:,1]),:]; 
       W[i,id] = soln[:,1]
       W11[i,id] = soln[:,2]
       W12[i,id] = soln[:,3]
       W13[i,id] = soln[:,4]
       W21[i,id] = soln[:,5]
       W22[i,id] = soln[:,6]
       W23[i,id] = soln[:,7]
    end
    return W,W11,W12,W13,W21,W22,W23
end

N = 15
PT = rand(N^3,3);
ip =collect(1:N^3);
xx,yy,zz =  PT[:,1], PT[:,2], PT[:,3]; data = [xx yy zz];

r = 0.3

balltree = BallTree(data')
st = inrange(balltree, data', r, true)  # https://github.com/KristofferC/NearestNeighbors.jl
m = 0.5 

when I benchmark the codes the results are as follows:

@benchmark main(st,PT,ip,m)
BenchmarkTools.Trial: 1 sample with 1 evaluation.
 Single result which took 53.736 s (3.01% GC) to evaluate,
 with a memory estimate of 193.54 GiB, over 864968 allocations.

If I increase the value of N, Python performance stands out more. So, my question is: Is it possible to speed up these Julia codes? Is it possible to beat Python. I would be grateful for any help.
Thanks.

yeah this code is very far from optimal. Can you post the ptyhon code for comparison?

Julia is column major, so it is better to iterate over the most inner indexes first. In this case, invert the loop order.

The above Julia codes are literal(one-to-one) translation of the Python codes. In Python I just used numba.

Dear Imiq, I have tried column major- row major. There is no significant changes in my case.

1 Like

Slices in Julia copy the array by default. Add @views to the whole function (use @views function main...).

(Unless you actually want the copies, of course)

For the inrange function, you may want to try this alternative: https://m3g.github.io/CellListMap.jl/stable/examples/#Neighbor-lists

It can be faster depending on the data, and runs in parallel.

Yeah, your main problem seems to be enormous amounts of allocations. Use views, and avoid allocating arrays as much as possible.

I’m also a bit skeptical about the way you are using sparse arrays. I’m not very experienced with sparse arrays, but there might be better ways than to iteratively assign elements to them.

3 Likes

This line is the most egregious, not because it allocates a lot, but because the allocation is so needless. Just use size(data, 1) instead.

3 Likes

Ok. I will do it. By the way using Threads.@threads in outer most for loop in function fun does not make significant contribution.

threading is the last thing and won’t be an improvement until the rest of the code is decent.

3 Likes

I have applied the suggestions such as adding @views to appropriate places. But, no improvement.

 @benchmark main(st,PT,ip,m)
 BenchmarkTools.Trial: 1 sample with 1 evaluation.
 Single result which took 45.548 s (3.36% GC) to evaluate,
 with a memory estimate of 191.89 GiB, over 1438599 allocations.

Views or not, lines like the above allocate a lot. You need to thoroughly go through your code and remove needless allocations. You still have 193Gigs(!) of memory use!

Have you also looked into whether you can improve sparse arrays?

I could not find a way for sparse matrices.

Indeed, Python is also faster in function fun, where there are not much allocations, than Julia 3x times.

Did you try changing the order of iteration to reflect column major storage?

Then change zero(n, n) to Matrix{Float64}(undef, n, n).

You’re also doing a lot of repeated array access and calculations (e.g. A[i,k]^3 multiple times).

I have used both row and column major, changed zero(n, n) to Matrix{Float64}(undef, n, n) and also avoided repeated calculations ( A[i,k]^3 ). There is no significance change.

For fun itself? Can you post the updated code and the benchmarking code for fun?

The updated fun

function fun(data,c)
   @views x = data[:,1]; 
   @views y = data[:,2]; 
   @views z = data[:,3]; n = size(data, 1);
   A,A1,B1,C1,A2,B2,C2 = Matrix{Float64}(undef, n, n),Matrix{Float64}(undef, n, n),Matrix{Float64}(undef, n, n),Matrix{Float64}(undef, n, n),Matrix{Float64}(undef, n, n),Matrix{Float64}(undef, n, n),Matrix{Float64}(undef, n, n)

   for k in 1:n
         @inbounds for i in 1:n
           xx = x[k]-x[i]
           yy = y[k]-y[i]
           zz = z[k]-z[i]
           xyz = sqrt(xx^2+yy^2+zz^2)

           A[i,k] = sqrt(xyz^2+c^2)
           A1[i,k] = xx/A[i,k]
           a3 = A[i,k]^3
           A2[i,k] = 1/A[i,k]-xx^2/a3
           B1[i,k] = yy/A[i,k]
           B2[i,k] = 1/A[i,k]-yy^2/a3
           C1[i,k] = zz/A[i,k]
           C2[i,k] = 1/A[i,k]-zz^2/a3

       end
   end
   return A,A1,B1,C1,A2,B2,C2
end

THe benhcmark results:

@benchmark fun(PT,m)
BenchmarkTools.Trial: 14 samples with 1 evaluation.
 Range (min … max):  314.097 ms … 473.421 ms  β”Š GC (min … max):  0.00% … 32.27%
 Time  (median):     353.295 ms               β”Š GC (median):     9.08%
 Time  (mean Β± Οƒ):   368.309 ms Β±  38.596 ms  β”Š GC (mean Β± Οƒ):  13.32% Β±  8.09%

               β–ˆβ–                                                
  β–†β–β–β–β–β–β–β–β–β–β–β–β–β–ˆβ–ˆβ–†β–β–†β–β–†β–β–†β–β–β–β–β–β–β–β–β–β–β–β–β–β–†β–†β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–† ▁
  314 ms           Histogram: frequency by time          473 ms <

 Memory estimate: 608.33 MiB, allocs estimate: 15.

As I said, Python is 3x faster in fun function.

The fun can be make ~3x faster by fixing the multiple calculations of A[i,j]^3, and the order of the loops. You can get it to be ~5x faster is using @turbo from LoopVectorization:

using LoopVectorization
function fun2(data,c)
   x = data[:,1]; y = data[:,2]; z = data[:,3]; n = length(x);
   A = Matrix{eltype(data)}(undef, n, n)
   A1, B1, C1, A2, B2, C2 = (similar(A) for _ in 1:6)

   for k in 1:n
         @turbo for i in 1:n
           xx = x[k]-x[i]
           yy = y[k]-y[i]
           zz = z[k]-z[i]
           xyz = sqrt(xx^2+yy^2+zz^2)

           A[i,k] = sqrt(xyz^2+c^2)
           iA = 1 / A[i,k]
           iA3 = iA^3
           A1[i,k] = xx*iA
           A2[i,k] = iA-xx^2*iA3
           B1[i,k] = yy*iA
           B2[i,k] = iA-yy^2*iA3
           C1[i,k] = zz*iA
           C2[i,k] = iA-zz^2*iA3

       end
   end
   return A,A1,B1,C1,A2,B2,C2
end

Nevertheless, since you are using 3D coordinates, and Julia is column-major, all code will benefit from defining the data as a (3,N) matrix, not (N,3), because then x, y and z coordinates of each point are in the same column. (That won’t change much the peformance of this specific function because you are copying the x, y, z coordinates in the first line here). But if you change the layout, you can avoid that and still be efficient. Actually, for that kind of data, a vector of static arrays is the most convenient structure (and has the same memory layout of the (3,N) matrx).

ps: just for the note: without rearranging the code more generally to take advantage of the structure of the data, preallocate arrays, etc, I wouldn’t expect that this function becomes much faster than what you get with @numba. Maybe you can write it in Julia slightly faster, but I would expect something of the same order. If the complete workflow is too slow in Python, than you can think of a more careful rearrangement of the code in Julia to avoid those allocations and get a probably much faster overall code.

3 Likes

Perhaps minor, but here you needlessly take a square root which you then square again:

I would also try not creating the views, but just writing data[k, 1] instead of x[k], etc.

Edit: I meant to quote @O_Julia.

2 Likes