Speeding up some non-optimized Julia functions

Thank you, this version of fun is faster. I wonder what else can be done for the ‘main’ function?

I can’t test much, I’m far away from my computer, but maybe

x2 = (x[k]-x[i])^2
y2 = (y[k]-y[i])^2
z2 = (z[k]-z[i])^2
xyz2 = x2+y2+z2 

etc.

Thank you, I try this.

Is there any multi-threading while using numpy, BTW? I mean, is numpy code implicitly multi-threaded?

I just use numba with @jit(nopython=True, parallel=True)

Can you try disabling parallel to see if that is a reason for the difference? But perhaps numpy still multithreads anyway?

If you can run the code in a way that guarantees single thread, we may be able to narrow down the cause for the difference.

perhaps some operation is saved by putting the formulas in this form

           xx = x[k]-x[i]
           yy = y[k]-y[i]
           zz = z[k]-z[i]
          A[i,k] = sqrt(xx^2+yy^2+zz^2+c^2)
          # iA = 1 / A[i,k]
          # iA3 = iA^3
           A1[i,k] = xx / A[i,k]
           A2[i,k] = (1-A1[i,k] ^2) / A[i,k] 
           B1[i,k] = yy / A[i,k]
           B2[i,k] = (1-B1[i,k] ^2) / A[i,k] 
           C1[i,k] = zz / A[i,k]
           C2[i,k] = (1-C1[i,k] ^2) / A[i,k] 
1 Like

I cannot go into details on your whole code, but for example if you preallocate a buffer to contain those A1 … C2 arrays, you can make fun non-allocating. Since you are calling many times fun, this can make a huge difference in the final timing, but you need to reorganize the outer code (and maybe estimate the size of the buffers). For fun you can do:

@views function fun4(data,c,buff)
   x = data[:,1]; y = data[:,2]; z = data[:,3]; n = length(x);
   A, A1, B1, C1, A2, B2, C2 = (buff[i] for i in 1:7)

   for k in 1:n
         @inbounds for i in 1:n
           xx = x[k]-x[i]
           yy = y[k]-y[i]
           zz = z[k]-z[i]
           xyz = sqrt(xx^2+yy^2+zz^2)

           A[i,k] = sqrt(xyz^2+c^2)
           iA = 1 / A[i,k]
           iA3 = iA^3
           A1[i,k] = xx*iA
           A2[i,k] = iA-xx^2*iA3
           B1[i,k] = yy*iA
           B2[i,k] = iA-yy^2*iA3
           C1[i,k] = zz*iA
           C2[i,k] = iA-zz^2*iA3

       end
   end
   return A,A1,B1,C1,A2,B2,C2
end

To get:

julia> data = rand(1000,3);

julia> buff = [Matrix{Float64}(undef, size(data,1), size(data,1)) for _ in 1:7];

julia> @btime fun4($data, 1.5, $buff);
  7.638 ms (0 allocations: 0 bytes)

If you find a way to eliminate the allocations of that are happening in all hot parts of the code, the result will be certainly much, much faster.

When I disable parallel (I mean @jit(nopython=True) ) in python I got following timings:

410 ms ± 3.19 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

if I activate parallel i got:

118 ms ± 3.33 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

lmiq, Thank you very much for your suggestions. I will go through your suggestions. I hope I can obtain satisfied improvements.

I haven’t read the whole thread but I’m surprised to find no mention of profiling, which is the first thing you should do when trying to optimize your code. Many people here have proposed clever solutions, but profiling is the step where you figure out what the problem is. There are many tools for the job, but the @profview macro from the Julia VSCode extension is my favorite

4 Likes

I had profiled the codes before posting here. But the results of the profiling is very confusing to me. By using @proview, I can see that for large values of N following lines are most time consuming.

  1. in fun4 of lmiq
iA3 = iA^3
  1. in fun2 of original post:
a =   [II x y z x ...]
  1. in main function of original post:
@views ind = filter(j -> id[j] == i,axes(id, 1)); 

One crude form of profiling is just to separate the work from the allocation of all these arrays. Doing this for fun, it looks like >90% of the time is allocating all those arrays. (Edit: close to what @lmiq said, I see.)

julia> @btime fun($data, $m)  # data == PT, m == c
  min 742.817 ms, mean 1.029 s (14 allocations, 608.33 MiB)
  
julia> @btime fun2!($A,$A1,$B1,$C1,$A2,$B2,$C2, $data, $m);
  min 70.387 ms, mean 70.631 ms (0 allocations)

None of them are used after you make L. (In which vcat [A1;a1'] makes an array, indexing [A1;a1'][:,ind] makes another, overall L hcat makes another.) Maybe you can just avoid making these at all, and directly make L?

Aren’t the matrices in function fun symmetric? In that case instead of looping over a square you just could loop over a triangle

I think that fun and fun2 can be used only once. In fact, you can compute L and R before the loop as follows

function compute_L_R(data::AbstractMatrix{<:Real}, c::Real)
    x = @view data[:, 1]
    y = @view data[:, 2]
    z = @view data[:, 3]

    T = eltype(x)
    n = length(x)
    L = zeros(T, n + 20, 7 * n)
    R = zeros(T, n + 20, n + 20)

    @views for i in 1:n, k in 1:n
        xx = x[k] - x[i]
        yy = y[k] - y[i]
        zz = z[k] - z[i]

        a = sqrt(xx^2 + yy^2 + zz^2 + c^2)
        ai = 1 / a
        a3 = 1 / a^3

        L[i, k] = a
        L[i, k+1n] = xx * ai
        L[i, k+2n] = yy * ai
        L[i, k+3n] = zz * ai
        L[i, k+4n] = ai - xx^2 * a3
        L[i, k+5n] = ai - yy^2 * a3
        L[i, k+6n] = ai - zz^2 * a3
    end

    # a, a1, ...
    rows = n .+ (1:20)
    o = zero(x)
    II = ones(size(x))
    L[n.+(1:20), 1:n] = [II x y z x .^ 2 x .* y x .* z y .^ 2 y .* z z .^ 2 x .^ 3 x .^ 2 .* y x .^ 2 .* z x .* y .^ 2 x .* y .* z x .* z .^ 2 y .^ 3 y .^ 2 .* z y .* z .^ 2 z .^ 3]'
    L[rows, 1n.+(1:n)] = [o II o o x * 2 y z o o o 3 * x .^ 2 2 * x .* y 2 * x .* z y .^ 2 y .* z z .^ 2 o o o o]'
    L[rows, 2n.+(1:n)] = [o o II o o x o y * 2 z o o x .^ 2 o x .* y * 2 x .* z o 3 * y .^ 2 2 * y .* z z .^ 2 o]'
    L[rows, 3n.+(1:n)] = [o o o II o o x o y z * 2 o o x .^ 2 o x .* y x .* z * 2 o y .^ 2 y .* z * 2 3 * z .^ 2]'
    L[rows, 4n.+(1:n)] = [o o o o II * 2 o o o o o x * 6 2 * y 2 * z o o o o o o o]'
    L[rows, 5n.+(1:n)] = [o o o o o o o II * 2 o o o o o x * 2 o o y * 6 2 * z o o]'
    L[rows, 6n.+(1:n)] = [o o o o o o o o o II * 2 o o o o o x * 2 o o y * 2 z * 6]'

    # R
    @views R[:, 1:n] .= L[:, 1:n]
    @views R[1:n, (n+1):end] .= L[(n+1):end, 1:n]'
    return L, R
end

and then the main function can be “simplified” in the following way

function main2(st, data::AbstractMatrix{<:Real}, perm, m::Real)
    T = eltype(data)
    n = length(data[:, 1])
    LL, RR = compute_L_R(data, m)
    W, W11, W12, W13, W21, W22, W23 = (zeros(T, n, n) for _ in 1:7)

    @views for i in perm
        id = st[i]
        ind = filter(j -> id[j] == i, axes(id, 1))
        rows = vcat(id, n .+ (1:20))
        cols = id[ind] .+ (0:n:(6*n))
        R = RR[rows, rows]
        L = LL[rows, cols]

        soln = (R\L)[1:length(id), :]
        W[i, id] = soln[:, 1]
        W11[i, id] = soln[:, 2]
        W12[i, id] = soln[:, 3]
        W13[i, id] = soln[:, 4]
        W21[i, id] = soln[:, 5]
        W22[i, id] = soln[:, 6]
        W23[i, id] = soln[:, 7]
    end
    return W, W11, W12, W13, W21, W22, W23
end

This function returns different results than the original one. I tested, that L and R in both implementations are almost the same (in the sense that isapprox is true), but the inversion (R\L) differs quite a lot. Maybe I made some mistakes in the code, but the idea should be correct.

julia> @time main(st, data, p, m);
 99.148773 seconds (884.86 k allocations: 189.158 GiB, 2.14% gc time)

julia> @time main2(st, data, p, m);
  7.372209 seconds (89.43 k allocations: 3.495 GiB, 2.27% gc time)

I got the biggest performance improvement when I replaced spzeros with zeros, but I don’t know why.

Thank you VaclavMacha,
indeed without using your implementation but just changing spzeros to zeros in most updated codes here, I have just obtain same timings as yours. But, I had thought using spzeros may decrease memory consumption. Strange.
I will try your codes also.

For performance reasons, you want to completely avoid indexing and incrementally growing a sparse array. If you are lucky and your sparse array has ordered entries (typically not true) you first have to search for the relevant entries if you wish to index into it. This is especially true if accessing by row, as sparse arrays store consecutive entries by column. This operation is at best O(log(n)). Next, any time you add new entries to a sparse array, it must allocate the necessary additional space and copy/shift over the displaced indices and values.
So the best way to make use of sparse arrays in this circumstance seems to collect portions of the solution into a buffer (for example with push!()) and then call sparse once on this buffer (Sparse Arrays · The Julia Language).

For a few days, I am trying to optimize these codes with help of many people here (thanks to all). In Python, I did not do any optimization but just used numba and I got very good timings according to julia. My point is: to have a fast julia code you have to spent a lot of time for optimization but this is not true in Python just write your codes, it is generally almost optimized. So, is it clever to use Julia if you have to spent days to get optimized Julia codes? Isn’t that a disadvantage?

No, you are just new to Julia, and need to learn different approaches to programming. In particular, directly translating between languages is often problematic, because techniques that are a advantageous in one language is disadvantageous in another (such as row- vs column major, or slices vs looping).

And furthermore, did we not just establish that the whole performance difference is due to using parallelism in python but not in julia?

7 Likes

Using @inbounds with 1:n is very, very dangerous. What if data has offset indices?

Never use @inbounds unless you can provably guarantee that the indices are valid.

For example, using 1:n is ok with a zeros(n, n) array, (though it is still, imho, poor style), but it is not OK for general AbstractMatrix.

2 Likes