Speeding up some non-optimized Julia functions

O_Julia · July 17, 2022, 3:38pm

Hi, I am converting some of my Python codes to Julia language. For minimal working example, I have simplified most time consuming part of my codes and give them below. In the current version. The Python codes are approximately 8x faster than Julia version. Here are the codes

using MKL
using Printf
using SparseArrays
using BenchmarkTools
using NearestNeighbors
using LoopVectorization
import VectorizedRoutines.Matlab: meshgrid


function fun(data,c)
   x = data[:,1]; y = data[:,2]; z = data[:,3]; n = length(x);
   A,A1,B1,C1,A2,B2,C2 = zeros(n,n),zeros(n,n),zeros(n,n),zeros(n,n),zeros(n,n),zeros(n,n),zeros(n,n)

   for i in 1:n
         @inbounds for k in 1:n
           xx = x[k]-x[i]
           yy = y[k]-y[i]
           zz = z[k]-z[i]
           xyz = sqrt(xx^2+yy^2+zz^2)

           A[i,k] = sqrt(xyz^2+c^2)
           A1[i,k] = xx/A[i,k]
           A2[i,k] = 1/A[i,k]-xx^2/A[i,k]^3
           B1[i,k] = yy/A[i,k]
           B2[i,k] = 1/A[i,k]-yy^2/A[i,k]^3
           C1[i,k] = zz/A[i,k]
           C2[i,k] = 1/A[i,k]-zz^2/A[i,k]^3

       end
   end
   return A,A1,B1,C1,A2,B2,C2
end


function fun2(data)
    x = data[:,1]; y = data[:,2]; z = data[:,3];o = zero(x); II = ones(size(x));
    a =   [II x y z x .^2 x.*y x.*z y .^2 y.*z z .^2 x .^3 x .^2 .*y x .^2 .*z x.*y .^2 x.*y.*z x.*z .^2 y .^3 y .^2 .*z y.*z .^2 z .^3]
    a1 =  [o II o o x*2 y z o o o 3*x .^2 2*x.*y 2*x.*z y .^2 y.*z z .^2 o o o o]
    b1 =  [o o II o o x o y*2 z o o x .^2 o x.*y*2 x.*z o 3*y .^2 2*y.*z z .^2 o]
    c1 =  [o o o II o o x o y z*2 o o x .^2 o x.*y x.*z*2 o y .^2 y.*z*2 3*z .^2]
    a2 = [o o o o II*2 o o o o o x*6 2*y 2*z o o o o o o o]
    b2 = [o o o o o o o II*2 o o o o o x*2 o o y*6 2*z o o]
    c2 = [o o o o o o o o o II*2 o o o o o x*2 o o y*2 z*6]

    return a,a1,b1,c1,a2,b2,c2
end

function main(st,data::Array{Float64,2},p::Array{Int64,1},m)
    N = length(data[:,1])
    W,W11,W12,W13,W21,W22,W23 = spzeros(N,N),spzeros(N,N),spzeros(N,N),spzeros(N,N),spzeros(N,N),spzeros(N,N),spzeros(N,N)
    
    dd = zeros(20,20);
    @inbounds for i in p
       id=st[i]
       t_data = data[id,:];
       A,A1,B1,C1,A2,B2,C2 = fun(t_data,m)
       a,a1,b1,c1,a2,b2,c2  = fun2(t_data)
       R = [A a; a' dd];
       ind = filter(j -> id[j] == i,axes(id, 1)); 
       L = [[A;a'][:,ind] [A1;a1'][:,ind] [B1;b1'][:,ind] [C1;c1'][:,ind] [A2;a2'][:,ind] [B2;b2'][:,ind] [C2;c2'][:,ind]];

       soln = (R\L)[1:length(t_data[:,1]),:]; 
       W[i,id] = soln[:,1]
       W11[i,id] = soln[:,2]
       W12[i,id] = soln[:,3]
       W13[i,id] = soln[:,4]
       W21[i,id] = soln[:,5]
       W22[i,id] = soln[:,6]
       W23[i,id] = soln[:,7]
    end
    return W,W11,W12,W13,W21,W22,W23
end

N = 15
PT = rand(N^3,3);
ip =collect(1:N^3);
xx,yy,zz =  PT[:,1], PT[:,2], PT[:,3]; data = [xx yy zz];

r = 0.3

balltree = BallTree(data')
st = inrange(balltree, data', r, true)  # https://github.com/KristofferC/NearestNeighbors.jl
m = 0.5

when I benchmark the codes the results are as follows:

@benchmark main(st,PT,ip,m)
BenchmarkTools.Trial: 1 sample with 1 evaluation.
 Single result which took 53.736 s (3.01% GC) to evaluate,
 with a memory estimate of 193.54 GiB, over 864968 allocations.

If I increase the value of N, Python performance stands out more. So, my question is: Is it possible to speed up these Julia codes? Is it possible to beat Python. I would be grateful for any help.
Thanks.

Oscar_Smith · July 17, 2022, 4:20pm

yeah this code is very far from optimal. Can you post the ptyhon code for comparison?

lmiq · July 17, 2022, 4:26pm

Julia is column major, so it is better to iterate over the most inner indexes first. In this case, invert the loop order.

O_Julia · July 17, 2022, 4:26pm

The above Julia codes are literal(one-to-one) translation of the Python codes. In Python I just used numba.

O_Julia · July 17, 2022, 4:28pm

Dear Imiq, I have tried column major- row major. There is no significant changes in my case.

lmiq · July 17, 2022, 4:28pm

Slices in Julia copy the array by default. Add @views to the whole function (use @views function main...).

(Unless you actually want the copies, of course)

For the inrange function, you may want to try this alternative: https://m3g.github.io/CellListMap.jl/stable/examples/#Neighbor-lists

It can be faster depending on the data, and runs in parallel.

DNF · July 17, 2022, 4:35pm

Yeah, your main problem seems to be enormous amounts of allocations. Use views, and avoid allocating arrays as much as possible.

I’m also a bit skeptical about the way you are using sparse arrays. I’m not very experienced with sparse arrays, but there might be better ways than to iteratively assign elements to them.

DNF · July 17, 2022, 4:37pm

This line is the most egregious, not because it allocates a lot, but because the allocation is so needless. Just use size(data, 1) instead.

O_Julia · July 17, 2022, 4:39pm

Ok. I will do it. By the way using Threads.@threads in outer most for loop in function fun does not make significant contribution.

Oscar_Smith · July 17, 2022, 4:40pm

threading is the last thing and won’t be an improvement until the rest of the code is decent.

O_Julia · July 17, 2022, 5:07pm

I have applied the suggestions such as adding @views to appropriate places. But, no improvement.

 @benchmark main(st,PT,ip,m)
 BenchmarkTools.Trial: 1 sample with 1 evaluation.
 Single result which took 45.548 s (3.36% GC) to evaluate,
 with a memory estimate of 191.89 GiB, over 1438599 allocations.

DNF · July 17, 2022, 6:07pm

Views or not, lines like the above allocate a lot. You need to thoroughly go through your code and remove needless allocations. You still have 193Gigs(!) of memory use!

Have you also looked into whether you can improve sparse arrays?

O_Julia · July 17, 2022, 6:12pm

I could not find a way for sparse matrices.

O_Julia · July 17, 2022, 6:27pm

Indeed, Python is also faster in function fun, where there are not much allocations, than Julia 3x times.

DNF · July 17, 2022, 6:53pm

Did you try changing the order of iteration to reflect column major storage?

Then change zero(n, n) to Matrix{Float64}(undef, n, n).

You’re also doing a lot of repeated array access and calculations (e.g. A[i,k]^3 multiple times).

O_Julia · July 17, 2022, 7:03pm

I have used both row and column major, changed zero(n, n) to Matrix{Float64}(undef, n, n) and also avoided repeated calculations ( A[i,k]^3 ). There is no significance change.

DNF · July 17, 2022, 7:11pm

For fun itself? Can you post the updated code and the benchmarking code for fun?

O_Julia · July 17, 2022, 7:13pm

The updated fun

function fun(data,c)
   @views x = data[:,1]; 
   @views y = data[:,2]; 
   @views z = data[:,3]; n = size(data, 1);
   A,A1,B1,C1,A2,B2,C2 = Matrix{Float64}(undef, n, n),Matrix{Float64}(undef, n, n),Matrix{Float64}(undef, n, n),Matrix{Float64}(undef, n, n),Matrix{Float64}(undef, n, n),Matrix{Float64}(undef, n, n),Matrix{Float64}(undef, n, n)

   for k in 1:n
         @inbounds for i in 1:n
           xx = x[k]-x[i]
           yy = y[k]-y[i]
           zz = z[k]-z[i]
           xyz = sqrt(xx^2+yy^2+zz^2)

           A[i,k] = sqrt(xyz^2+c^2)
           A1[i,k] = xx/A[i,k]
           a3 = A[i,k]^3
           A2[i,k] = 1/A[i,k]-xx^2/a3
           B1[i,k] = yy/A[i,k]
           B2[i,k] = 1/A[i,k]-yy^2/a3
           C1[i,k] = zz/A[i,k]
           C2[i,k] = 1/A[i,k]-zz^2/a3

       end
   end
   return A,A1,B1,C1,A2,B2,C2
end

THe benhcmark results:

@benchmark fun(PT,m)
BenchmarkTools.Trial: 14 samples with 1 evaluation.
 Range (min … max):  314.097 ms … 473.421 ms  ┊ GC (min … max):  0.00% … 32.27%
 Time  (median):     353.295 ms               ┊ GC (median):     9.08%
 Time  (mean ± σ):   368.309 ms ±  38.596 ms  ┊ GC (mean ± σ):  13.32% ±  8.09%

               █▁                                                
  ▆▁▁▁▁▁▁▁▁▁▁▁▁██▆▁▆▁▆▁▆▁▁▁▁▁▁▁▁▁▁▁▁▁▆▆▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▆ ▁
  314 ms           Histogram: frequency by time          473 ms <

 Memory estimate: 608.33 MiB, allocs estimate: 15.

As I said, Python is 3x faster in fun function.

lmiq · July 17, 2022, 7:19pm

The fun can be make ~3x faster by fixing the multiple calculations of A[i,j]^3, and the order of the loops. You can get it to be ~5x faster is using @turbo from LoopVectorization:

using LoopVectorization
function fun2(data,c)
   x = data[:,1]; y = data[:,2]; z = data[:,3]; n = length(x);
   A = Matrix{eltype(data)}(undef, n, n)
   A1, B1, C1, A2, B2, C2 = (similar(A) for _ in 1:6)

   for k in 1:n
         @turbo for i in 1:n
           xx = x[k]-x[i]
           yy = y[k]-y[i]
           zz = z[k]-z[i]
           xyz = sqrt(xx^2+yy^2+zz^2)

           A[i,k] = sqrt(xyz^2+c^2)
           iA = 1 / A[i,k]
           iA3 = iA^3
           A1[i,k] = xx*iA
           A2[i,k] = iA-xx^2*iA3
           B1[i,k] = yy*iA
           B2[i,k] = iA-yy^2*iA3
           C1[i,k] = zz*iA
           C2[i,k] = iA-zz^2*iA3

       end
   end
   return A,A1,B1,C1,A2,B2,C2
end

Nevertheless, since you are using 3D coordinates, and Julia is column-major, all code will benefit from defining the data as a (3,N) matrix, not (N,3), because then x, y and z coordinates of each point are in the same column. (That won’t change much the peformance of this specific function because you are copying the x, y, z coordinates in the first line here). But if you change the layout, you can avoid that and still be efficient. Actually, for that kind of data, a vector of static arrays is the most convenient structure (and has the same memory layout of the (3,N) matrx).

ps: just for the note: without rearranging the code more generally to take advantage of the structure of the data, preallocate arrays, etc, I wouldn’t expect that this function becomes much faster than what you get with @numba. Maybe you can write it in Julia slightly faster, but I would expect something of the same order. If the complete workflow is too slow in Python, than you can think of a more careful rearrangement of the code in Julia to avoid those allocations and get a probably much faster overall code.

DNF · July 17, 2022, 7:23pm

Perhaps minor, but here you needlessly take a square root which you then square again:

I would also try not creating the views, but just writing data[k, 1] instead of x[k], etc.

Edit: I meant to quote @O_Julia.

Topic		Replies	Views
Making efficient some small subroutines Performance	22	1133	May 8, 2020
Sparse matrix-vector product: much more slow than Matlab Performance matlab , optimization	24	4545	December 20, 2017
How to efficiently construct a large SparseArray? Packages for this? Performance package , performance , parallel , sparse	20	1763	May 15, 2022
Why is this simple function twice as slow as its Python version Performance question	97	4433	April 12, 2021
Julia vs Matlab - building a Jacobian matrix New to Julia	30	4224	March 28, 2019

Speeding up some non-optimized Julia functions

Related topics