Thank you, this version of fun
is faster. I wonder what else can be done for the ‘main
’ function?
I can’t test much, I’m far away from my computer, but maybe
x2 = (x[k]-x[i])^2
y2 = (y[k]-y[i])^2
z2 = (z[k]-z[i])^2
xyz2 = x2+y2+z2
etc.
Thank you, I try this.
Is there any multi-threading while using numpy, BTW? I mean, is numpy code implicitly multi-threaded?
I just use numba with @jit(nopython=True, parallel=True)
Can you try disabling parallel to see if that is a reason for the difference? But perhaps numpy still multithreads anyway?
If you can run the code in a way that guarantees single thread, we may be able to narrow down the cause for the difference.
perhaps some operation is saved by putting the formulas in this form
xx = x[k]-x[i]
yy = y[k]-y[i]
zz = z[k]-z[i]
A[i,k] = sqrt(xx^2+yy^2+zz^2+c^2)
# iA = 1 / A[i,k]
# iA3 = iA^3
A1[i,k] = xx / A[i,k]
A2[i,k] = (1-A1[i,k] ^2) / A[i,k]
B1[i,k] = yy / A[i,k]
B2[i,k] = (1-B1[i,k] ^2) / A[i,k]
C1[i,k] = zz / A[i,k]
C2[i,k] = (1-C1[i,k] ^2) / A[i,k]
I cannot go into details on your whole code, but for example if you preallocate a buffer to contain those A1 … C2 arrays, you can make fun
non-allocating. Since you are calling many times fun
, this can make a huge difference in the final timing, but you need to reorganize the outer code (and maybe estimate the size of the buffers). For fun
you can do:
@views function fun4(data,c,buff)
x = data[:,1]; y = data[:,2]; z = data[:,3]; n = length(x);
A, A1, B1, C1, A2, B2, C2 = (buff[i] for i in 1:7)
for k in 1:n
@inbounds for i in 1:n
xx = x[k]-x[i]
yy = y[k]-y[i]
zz = z[k]-z[i]
xyz = sqrt(xx^2+yy^2+zz^2)
A[i,k] = sqrt(xyz^2+c^2)
iA = 1 / A[i,k]
iA3 = iA^3
A1[i,k] = xx*iA
A2[i,k] = iA-xx^2*iA3
B1[i,k] = yy*iA
B2[i,k] = iA-yy^2*iA3
C1[i,k] = zz*iA
C2[i,k] = iA-zz^2*iA3
end
end
return A,A1,B1,C1,A2,B2,C2
end
To get:
julia> data = rand(1000,3);
julia> buff = [Matrix{Float64}(undef, size(data,1), size(data,1)) for _ in 1:7];
julia> @btime fun4($data, 1.5, $buff);
7.638 ms (0 allocations: 0 bytes)
If you find a way to eliminate the allocations of that are happening in all hot parts of the code, the result will be certainly much, much faster.
When I disable parallel (I mean @jit(nopython=True)
) in python I got following timings:
410 ms ± 3.19 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
if I activate parallel i got:
118 ms ± 3.33 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
lmiq, Thank you very much for your suggestions. I will go through your suggestions. I hope I can obtain satisfied improvements.
I haven’t read the whole thread but I’m surprised to find no mention of profiling, which is the first thing you should do when trying to optimize your code. Many people here have proposed clever solutions, but profiling is the step where you figure out what the problem is. There are many tools for the job, but the @profview
macro from the Julia VSCode extension is my favorite
I had profiled the codes before posting here. But the results of the profiling is very confusing to me. By using @proview
, I can see that for large values of N
following lines are most time consuming.
- in
fun4
of lmiq
iA3 = iA^3
- in
fun2
of original post:
a = [II x y z x ...]
- in
main
function of original post:
@views ind = filter(j -> id[j] == i,axes(id, 1));
One crude form of profiling is just to separate the work from the allocation of all these arrays. Doing this for fun
, it looks like >90% of the time is allocating all those arrays. (Edit: close to what @lmiq said, I see.)
julia> @btime fun($data, $m) # data == PT, m == c
min 742.817 ms, mean 1.029 s (14 allocations, 608.33 MiB)
julia> @btime fun2!($A,$A1,$B1,$C1,$A2,$B2,$C2, $data, $m);
min 70.387 ms, mean 70.631 ms (0 allocations)
None of them are used after you make L
. (In which vcat [A1;a1']
makes an array, indexing [A1;a1'][:,ind]
makes another, overall L
hcat makes another.) Maybe you can just avoid making these at all, and directly make L
?
Aren’t the matrices in function fun symmetric? In that case instead of looping over a square you just could loop over a triangle
I think that fun
and fun2
can be used only once. In fact, you can compute L
and R
before the loop as follows
function compute_L_R(data::AbstractMatrix{<:Real}, c::Real)
x = @view data[:, 1]
y = @view data[:, 2]
z = @view data[:, 3]
T = eltype(x)
n = length(x)
L = zeros(T, n + 20, 7 * n)
R = zeros(T, n + 20, n + 20)
@views for i in 1:n, k in 1:n
xx = x[k] - x[i]
yy = y[k] - y[i]
zz = z[k] - z[i]
a = sqrt(xx^2 + yy^2 + zz^2 + c^2)
ai = 1 / a
a3 = 1 / a^3
L[i, k] = a
L[i, k+1n] = xx * ai
L[i, k+2n] = yy * ai
L[i, k+3n] = zz * ai
L[i, k+4n] = ai - xx^2 * a3
L[i, k+5n] = ai - yy^2 * a3
L[i, k+6n] = ai - zz^2 * a3
end
# a, a1, ...
rows = n .+ (1:20)
o = zero(x)
II = ones(size(x))
L[n.+(1:20), 1:n] = [II x y z x .^ 2 x .* y x .* z y .^ 2 y .* z z .^ 2 x .^ 3 x .^ 2 .* y x .^ 2 .* z x .* y .^ 2 x .* y .* z x .* z .^ 2 y .^ 3 y .^ 2 .* z y .* z .^ 2 z .^ 3]'
L[rows, 1n.+(1:n)] = [o II o o x * 2 y z o o o 3 * x .^ 2 2 * x .* y 2 * x .* z y .^ 2 y .* z z .^ 2 o o o o]'
L[rows, 2n.+(1:n)] = [o o II o o x o y * 2 z o o x .^ 2 o x .* y * 2 x .* z o 3 * y .^ 2 2 * y .* z z .^ 2 o]'
L[rows, 3n.+(1:n)] = [o o o II o o x o y z * 2 o o x .^ 2 o x .* y x .* z * 2 o y .^ 2 y .* z * 2 3 * z .^ 2]'
L[rows, 4n.+(1:n)] = [o o o o II * 2 o o o o o x * 6 2 * y 2 * z o o o o o o o]'
L[rows, 5n.+(1:n)] = [o o o o o o o II * 2 o o o o o x * 2 o o y * 6 2 * z o o]'
L[rows, 6n.+(1:n)] = [o o o o o o o o o II * 2 o o o o o x * 2 o o y * 2 z * 6]'
# R
@views R[:, 1:n] .= L[:, 1:n]
@views R[1:n, (n+1):end] .= L[(n+1):end, 1:n]'
return L, R
end
and then the main function can be “simplified” in the following way
function main2(st, data::AbstractMatrix{<:Real}, perm, m::Real)
T = eltype(data)
n = length(data[:, 1])
LL, RR = compute_L_R(data, m)
W, W11, W12, W13, W21, W22, W23 = (zeros(T, n, n) for _ in 1:7)
@views for i in perm
id = st[i]
ind = filter(j -> id[j] == i, axes(id, 1))
rows = vcat(id, n .+ (1:20))
cols = id[ind] .+ (0:n:(6*n))
R = RR[rows, rows]
L = LL[rows, cols]
soln = (R\L)[1:length(id), :]
W[i, id] = soln[:, 1]
W11[i, id] = soln[:, 2]
W12[i, id] = soln[:, 3]
W13[i, id] = soln[:, 4]
W21[i, id] = soln[:, 5]
W22[i, id] = soln[:, 6]
W23[i, id] = soln[:, 7]
end
return W, W11, W12, W13, W21, W22, W23
end
This function returns different results than the original one. I tested, that L
and R
in both implementations are almost the same (in the sense that isapprox
is true), but the inversion (R\L
) differs quite a lot. Maybe I made some mistakes in the code, but the idea should be correct.
julia> @time main(st, data, p, m);
99.148773 seconds (884.86 k allocations: 189.158 GiB, 2.14% gc time)
julia> @time main2(st, data, p, m);
7.372209 seconds (89.43 k allocations: 3.495 GiB, 2.27% gc time)
I got the biggest performance improvement when I replaced spzeros
with zeros
, but I don’t know why.
Thank you VaclavMacha,
indeed without using your implementation but just changing spzeros
to zeros
in most updated codes here, I have just obtain same timings as yours. But, I had thought using spzeros
may decrease memory consumption. Strange.
I will try your codes also.
For performance reasons, you want to completely avoid indexing and incrementally growing a sparse array. If you are lucky and your sparse array has ordered entries (typically not true) you first have to search for the relevant entries if you wish to index into it. This is especially true if accessing by row, as sparse arrays store consecutive entries by column. This operation is at best O(log(n)). Next, any time you add new entries to a sparse array, it must allocate the necessary additional space and copy/shift over the displaced indices and values.
So the best way to make use of sparse arrays in this circumstance seems to collect portions of the solution into a buffer (for example with push!()
) and then call sparse
once on this buffer (Sparse Arrays · The Julia Language).
For a few days, I am trying to optimize these codes with help of many people here (thanks to all). In Python, I did not do any optimization but just used numba
and I got very good timings according to julia. My point is: to have a fast julia code you have to spent a lot of time for optimization but this is not true in Python just write your codes, it is generally almost optimized. So, is it clever to use Julia if you have to spent days to get optimized Julia codes? Isn’t that a disadvantage?
No, you are just new to Julia, and need to learn different approaches to programming. In particular, directly translating between languages is often problematic, because techniques that are a advantageous in one language is disadvantageous in another (such as row- vs column major, or slices vs looping).
And furthermore, did we not just establish that the whole performance difference is due to using parallelism in python but not in julia?
Using @inbounds
with 1:n
is very, very dangerous. What if data
has offset indices?
Never use @inbounds
unless you can provably guarantee that the indices are valid.
For example, using 1:n
is ok with a zeros(n, n)
array, (though it is still, imho, poor style), but it is not OK for general AbstractMatrix
.