Speeding up some non-optimized Julia functions

O_Julia · July 17, 2022, 7:24pm

Thank you, this version of fun is faster. I wonder what else can be done for the ‘main’ function?

DNF · July 17, 2022, 7:28pm

I can’t test much, I’m far away from my computer, but maybe

x2 = (x[k]-x[i])^2
y2 = (y[k]-y[i])^2
z2 = (z[k]-z[i])^2
xyz2 = x2+y2+z2

etc.

O_Julia · July 17, 2022, 7:29pm

Thank you, I try this.

DNF · July 17, 2022, 7:31pm

Is there any multi-threading while using numpy, BTW? I mean, is numpy code implicitly multi-threaded?

O_Julia · July 17, 2022, 7:34pm

I just use numba with @jit(nopython=True, parallel=True)

DNF · July 17, 2022, 7:41pm

Can you try disabling parallel to see if that is a reason for the difference? But perhaps numpy still multithreads anyway?

If you can run the code in a way that guarantees single thread, we may be able to narrow down the cause for the difference.

rocco_sprmnt21 · July 17, 2022, 10:33pm

lmiq:

for k in 1:n
         @turbo for i in 1:n
           xx = x[k]-x[i]
           yy = y[k]-y[i]
           zz = z[k]-z[i]
           xyz = sqrt(xx^2+yy^2+zz^2)

           A[i,k] = sqrt(xyz^2+c^2)
           iA = 1 / A[i,k]
           iA3 = iA^3
           A1[i,k] = xx*iA
           A2[i,k] = iA-xx^2*iA3
           B1[i,k] = yy*iA
           B2[i,k] = iA-yy^2*iA3
           C1[i,k] = zz*iA
           C2[i,k] = iA-zz^2*iA3

       end

perhaps some operation is saved by putting the formulas in this form

           xx = x[k]-x[i]
           yy = y[k]-y[i]
           zz = z[k]-z[i]
          A[i,k] = sqrt(xx^2+yy^2+zz^2+c^2)
          # iA = 1 / A[i,k]
          # iA3 = iA^3
           A1[i,k] = xx / A[i,k]
           A2[i,k] = (1-A1[i,k] ^2) / A[i,k] 
           B1[i,k] = yy / A[i,k]
           B2[i,k] = (1-B1[i,k] ^2) / A[i,k] 
           C1[i,k] = zz / A[i,k]
           C2[i,k] = (1-C1[i,k] ^2) / A[i,k]

lmiq · July 17, 2022, 11:26pm

I cannot go into details on your whole code, but for example if you preallocate a buffer to contain those A1 … C2 arrays, you can make fun non-allocating. Since you are calling many times fun, this can make a huge difference in the final timing, but you need to reorganize the outer code (and maybe estimate the size of the buffers). For fun you can do:

@views function fun4(data,c,buff)
   x = data[:,1]; y = data[:,2]; z = data[:,3]; n = length(x);
   A, A1, B1, C1, A2, B2, C2 = (buff[i] for i in 1:7)

   for k in 1:n
         @inbounds for i in 1:n
           xx = x[k]-x[i]
           yy = y[k]-y[i]
           zz = z[k]-z[i]
           xyz = sqrt(xx^2+yy^2+zz^2)

           A[i,k] = sqrt(xyz^2+c^2)
           iA = 1 / A[i,k]
           iA3 = iA^3
           A1[i,k] = xx*iA
           A2[i,k] = iA-xx^2*iA3
           B1[i,k] = yy*iA
           B2[i,k] = iA-yy^2*iA3
           C1[i,k] = zz*iA
           C2[i,k] = iA-zz^2*iA3

       end
   end
   return A,A1,B1,C1,A2,B2,C2
end

To get:

julia> data = rand(1000,3);

julia> buff = [Matrix{Float64}(undef, size(data,1), size(data,1)) for _ in 1:7];

julia> @btime fun4($data, 1.5, $buff);
  7.638 ms (0 allocations: 0 bytes)

If you find a way to eliminate the allocations of that are happening in all hot parts of the code, the result will be certainly much, much faster.

O_Julia · July 17, 2022, 11:59pm

When I disable parallel (I mean @jit(nopython=True) ) in python I got following timings:

410 ms ± 3.19 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

if I activate parallel i got:

118 ms ± 3.33 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

O_Julia · July 18, 2022, 12:02am

lmiq, Thank you very much for your suggestions. I will go through your suggestions. I hope I can obtain satisfied improvements.

gdalle · July 18, 2022, 5:09am

I haven’t read the whole thread but I’m surprised to find no mention of profiling, which is the first thing you should do when trying to optimize your code. Many people here have proposed clever solutions, but profiling is the step where you figure out what the problem is. There are many tools for the job, but the @profview macro from the Julia VSCode extension is my favorite

O_Julia · July 18, 2022, 3:23pm

I had profiled the codes before posting here. But the results of the profiling is very confusing to me. By using @proview, I can see that for large values of N following lines are most time consuming.

in fun4 of lmiq

iA3 = iA^3

in fun2 of original post:

a =   [II x y z x ...]

in main function of original post:

@views ind = filter(j -> id[j] == i,axes(id, 1));

mcabbott · July 18, 2022, 4:03pm

One crude form of profiling is just to separate the work from the allocation of all these arrays. Doing this for fun, it looks like >90% of the time is allocating all those arrays. (Edit: close to what @lmiq said, I see.)

julia> @btime fun($data, $m)  # data == PT, m == c
  min 742.817 ms, mean 1.029 s (14 allocations, 608.33 MiB)
  
julia> @btime fun2!($A,$A1,$B1,$C1,$A2,$B2,$C2, $data, $m);
  min 70.387 ms, mean 70.631 ms (0 allocations)

None of them are used after you make L. (In which vcat [A1;a1'] makes an array, indexing [A1;a1'][:,ind] makes another, overall L hcat makes another.) Maybe you can just avoid making these at all, and directly make L?

JM_Beckers · July 18, 2022, 4:35pm

Aren’t the matrices in function fun symmetric? In that case instead of looping over a square you just could loop over a triangle

VaclavMacha · July 18, 2022, 6:58pm

I think that fun and fun2 can be used only once. In fact, you can compute L and R before the loop as follows

function compute_L_R(data::AbstractMatrix{<:Real}, c::Real)
    x = @view data[:, 1]
    y = @view data[:, 2]
    z = @view data[:, 3]

    T = eltype(x)
    n = length(x)
    L = zeros(T, n + 20, 7 * n)
    R = zeros(T, n + 20, n + 20)

    @views for i in 1:n, k in 1:n
        xx = x[k] - x[i]
        yy = y[k] - y[i]
        zz = z[k] - z[i]

        a = sqrt(xx^2 + yy^2 + zz^2 + c^2)
        ai = 1 / a
        a3 = 1 / a^3

        L[i, k] = a
        L[i, k+1n] = xx * ai
        L[i, k+2n] = yy * ai
        L[i, k+3n] = zz * ai
        L[i, k+4n] = ai - xx^2 * a3
        L[i, k+5n] = ai - yy^2 * a3
        L[i, k+6n] = ai - zz^2 * a3
    end

    # a, a1, ...
    rows = n .+ (1:20)
    o = zero(x)
    II = ones(size(x))
    L[n.+(1:20), 1:n] = [II x y z x .^ 2 x .* y x .* z y .^ 2 y .* z z .^ 2 x .^ 3 x .^ 2 .* y x .^ 2 .* z x .* y .^ 2 x .* y .* z x .* z .^ 2 y .^ 3 y .^ 2 .* z y .* z .^ 2 z .^ 3]'
    L[rows, 1n.+(1:n)] = [o II o o x * 2 y z o o o 3 * x .^ 2 2 * x .* y 2 * x .* z y .^ 2 y .* z z .^ 2 o o o o]'
    L[rows, 2n.+(1:n)] = [o o II o o x o y * 2 z o o x .^ 2 o x .* y * 2 x .* z o 3 * y .^ 2 2 * y .* z z .^ 2 o]'
    L[rows, 3n.+(1:n)] = [o o o II o o x o y z * 2 o o x .^ 2 o x .* y x .* z * 2 o y .^ 2 y .* z * 2 3 * z .^ 2]'
    L[rows, 4n.+(1:n)] = [o o o o II * 2 o o o o o x * 6 2 * y 2 * z o o o o o o o]'
    L[rows, 5n.+(1:n)] = [o o o o o o o II * 2 o o o o o x * 2 o o y * 6 2 * z o o]'
    L[rows, 6n.+(1:n)] = [o o o o o o o o o II * 2 o o o o o x * 2 o o y * 2 z * 6]'

    # R
    @views R[:, 1:n] .= L[:, 1:n]
    @views R[1:n, (n+1):end] .= L[(n+1):end, 1:n]'
    return L, R
end

and then the main function can be “simplified” in the following way

function main2(st, data::AbstractMatrix{<:Real}, perm, m::Real)
    T = eltype(data)
    n = length(data[:, 1])
    LL, RR = compute_L_R(data, m)
    W, W11, W12, W13, W21, W22, W23 = (zeros(T, n, n) for _ in 1:7)

    @views for i in perm
        id = st[i]
        ind = filter(j -> id[j] == i, axes(id, 1))
        rows = vcat(id, n .+ (1:20))
        cols = id[ind] .+ (0:n:(6*n))
        R = RR[rows, rows]
        L = LL[rows, cols]

        soln = (R\L)[1:length(id), :]
        W[i, id] = soln[:, 1]
        W11[i, id] = soln[:, 2]
        W12[i, id] = soln[:, 3]
        W13[i, id] = soln[:, 4]
        W21[i, id] = soln[:, 5]
        W22[i, id] = soln[:, 6]
        W23[i, id] = soln[:, 7]
    end
    return W, W11, W12, W13, W21, W22, W23
end

This function returns different results than the original one. I tested, that L and R in both implementations are almost the same (in the sense that isapprox is true), but the inversion (R\L) differs quite a lot. Maybe I made some mistakes in the code, but the idea should be correct.

julia> @time main(st, data, p, m);
 99.148773 seconds (884.86 k allocations: 189.158 GiB, 2.14% gc time)

julia> @time main2(st, data, p, m);
  7.372209 seconds (89.43 k allocations: 3.495 GiB, 2.27% gc time)

I got the biggest performance improvement when I replaced spzeros with zeros, but I don’t know why.

O_Julia · July 18, 2022, 8:26pm

Thank you VaclavMacha,
indeed without using your implementation but just changing spzeros to zeros in most updated codes here, I have just obtain same timings as yours. But, I had thought using spzeros may decrease memory consumption. Strange.
I will try your codes also.

aegger · July 18, 2022, 8:46pm

For performance reasons, you want to completely avoid indexing and incrementally growing a sparse array. If you are lucky and your sparse array has ordered entries (typically not true) you first have to search for the relevant entries if you wish to index into it. This is especially true if accessing by row, as sparse arrays store consecutive entries by column. This operation is at best O(log(n)). Next, any time you add new entries to a sparse array, it must allocate the necessary additional space and copy/shift over the displaced indices and values.
So the best way to make use of sparse arrays in this circumstance seems to collect portions of the solution into a buffer (for example with push!()) and then call sparse once on this buffer (Sparse Arrays · The Julia Language).

O_Julia · July 18, 2022, 9:05pm

For a few days, I am trying to optimize these codes with help of many people here (thanks to all). In Python, I did not do any optimization but just used numba and I got very good timings according to julia. My point is: to have a fast julia code you have to spent a lot of time for optimization but this is not true in Python just write your codes, it is generally almost optimized. So, is it clever to use Julia if you have to spent days to get optimized Julia codes? Isn’t that a disadvantage?

DNF · July 18, 2022, 9:17pm

No, you are just new to Julia, and need to learn different approaches to programming. In particular, directly translating between languages is often problematic, because techniques that are a advantageous in one language is disadvantageous in another (such as row- vs column major, or slices vs looping).

And furthermore, did we not just establish that the whole performance difference is due to using parallelism in python but not in julia?

DNF · July 18, 2022, 9:21pm

Using @inbounds with 1:n is very, very dangerous. What if data has offset indices?

Never use @inbounds unless you can provably guarantee that the indices are valid.

For example, using 1:n is ok with a zeros(n, n) array, (though it is still, imho, poor style), but it is not OK for general AbstractMatrix.

Topic		Replies	Views
Making efficient some small subroutines Performance	22	1193	May 8, 2020
Sparse matrix-vector product: much more slow than Matlab Performance matlab , optimization	24	4638	December 20, 2017
Julia vs Matlab - building a Jacobian matrix New to Julia	30	4315	March 28, 2019
Kron vs scalar product speed difference. python code faster? New to Julia question	41	4243	January 14, 2017
Numpy 10x faster than Julia ?! What am I doing wrong ?! [solved - julia faster now] Performance question	37	11141	October 15, 2019

Speeding up some non-optimized Julia functions

Related topics