I would be a bit careful about using SMatrix
here. SMatrix
is for cases where the compiler can be sure of the sizes, and the sizes should be small. In this particular case POINTS_PER_BATCH
is small and fixed, but at least be aware of this.
I would be a bit careful about using
SMatrix
here.SMatrix
is for cases where the compiler can be sure of the sizes, and the sizes should be small. In this particular casePOINTS_PER_BATCH
is small and fixed, but at least be aware of this.
For this example everything is known, since POINTS_PER_BATCH
is known at compile time, but in general this can lead to problems. I am aware of this.
How fast does this run, if you use the above mentioned within your code?
Itâs even better than that. Since MM'
is symmetric and positive definite, the SVD is equivalent to the eigen factorization, and StaticArrays has a specialized eigen
for 3x3 matrices. Only problem is that svd
and eigen
doesnât use the same ordering of singular/eigen values and vectors, so you need to take that into account.
After playing around with this a bit, I think it should be possible to get the runtime down to approximately 1 second + printing (on my old laptop) for a batch size of 10^6. Then multithreading comes on top of that.
Thatâs compared to an original runtime of 18 minutes, including printing and compilation.
That does indeed work. It seems the eigenvalues and corresponding vectors are ordered in reverse of what the SVD does. So here my updated code. The speedup does not so much depend on POINTS_PER_BATCH
. For POINTS_PER_BATCH
of 8
and 256
the âfast SVD trickâ does provide a speed up of about 2.
using LinearAlgebra, Statistics, Random, StaticArrays, BenchmarkTools
struct Plane
point::SVector{3,Float64}
normal::SVector{3,Float64}
end
function mean_center(points::SMatrix)
centroid = mean(points,dims=2)
points_centered = points .- centroid
centroid, points_centered
end
function best_fit_plane(points::SMatrix)
centroid, centered_points = mean_center(points)
# slowest
#u = svd(centered_points).U[:,end]
# better
#m = centered_points*transpose(centered_points)
#u = svd(m).U[:,end]
# best
#m = Symmetric(centered_points*transpose(centered_points))
#u = eigen(m).vectors[:,1]
Plane(centroid, u)
end
function main()
MIN = -2000
MAX = 2000
BATCHES_COUNT = 1250
POINTS_PER_BATCH = 2^3 # or 2^8
BATCHES_COUNT = div(2^16,POINTS_PER_BATCH)
generateBatch() = MIN .+ (MAX - MIN)*rand(SMatrix{3,POINTS_PER_BATCH,Float64})
out = Vector{Plane}(undef,BATCHES_COUNT) # initialize output
for i in eachindex(out)
b = generateBatch()
out[i] = best_fit_plane(b)
end
return out
end
@benchmark main()
We start off here for POINTS_PER_BATCH = 8
using the slowest approach
BenchmarkTools.Trial:
memory estimate: 62.38 MiB
allocs estimate: 434178
--------------
minimum time: 129.477 ms (2.97% GC)
median time: 139.976 ms (2.76% GC)
mean time: 146.015 ms (3.71% GC)
maximum time: 214.297 ms (1.98% GC)
--------------
samples: 35
evals/sample: 1
The slow approach is slightly better
BenchmarkTools.Trial:
memory estimate: 58.63 MiB
allocs estimate: 450562
--------------
minimum time: 111.597 ms (3.00% GC)
median time: 116.901 ms (3.71% GC)
mean time: 122.655 ms (4.18% GC)
maximum time: 201.346 ms (4.26% GC)
--------------
samples: 41
evals/sample: 1
The fast approach yields the best results
BenchmarkTools.Trial:
memory estimate: 36.88 MiB
allocs estimate: 393218
--------------
minimum time: 57.961 ms (0.00% GC)
median time: 66.490 ms (6.86% GC)
mean time: 67.735 ms (6.30% GC)
maximum time: 87.808 ms (7.00% GC)
--------------
samples: 74
evals/sample: 1
We can get better results for larger POINTS_PER_BATCH = 256
using the slowest approach
BenchmarkTools.Trial:
memory estimate: 25.48 MiB
allocs estimate: 13825
--------------
minimum time: 14.025 ms (0.00% GC)
median time: 17.141 ms (11.42% GC)
mean time: 16.768 ms (7.38% GC)
maximum time: 25.514 ms (9.19% GC)
--------------
samples: 299
evals/sample: 1
The slow approach is slightly better
BenchmarkTools.Trial:
memory estimate: 19.49 MiB
allocs estimate: 14337
--------------
minimum time: 8.721 ms (0.00% GC)
median time: 11.221 ms (0.00% GC)
mean time: 11.340 ms (8.76% GC)
maximum time: 17.238 ms (14.03% GC)
--------------
samples: 441
evals/sample: 1
The fast approach yields the best results
BenchmarkTools.Trial:
memory estimate: 18.81 MiB
allocs estimate: 12545
--------------
minimum time: 6.666 ms (0.00% GC)
median time: 8.965 ms (0.00% GC)
mean time: 8.979 ms (10.28% GC)
maximum time: 14.986 ms (26.97% GC)
--------------
samples: 557
evals/sample: 1
Around 1 sec on my machine for a batch size 2^{20} and POINTS_PER_BATCH = 8
. No parallelization either.
It looks to me like mean(points; dims=Val(2))
improves performance slightly (actually, it dramatically speeds up mean_center
, but the overall performance isnât so much affected.)
I think it should be the last, not the first eigenvector. Also, I think itâs better to not access the fields like that. You can either use
_, V = eigen(m)
u = V[:, end]
or
u = eigvecs(m)[:, end]
Cunninghamâs Law states âthe best way to get the right answer on the internet is not to ask a question; itâs to post the wrong answer.â I think this thread suggests a new corollary
This is great from the first time unoptimized code posted here has been optimized to death. The main surprising part is that so far this thread have gotten to the typical 1000x speedup
I have tried this out and it seems eigen
and svd
have a different order. svd
puts the largest eigenvalue first, whereas eigen
of a symmetric 3\times 3 matrix seems to put the largest eigenvalue last.
using Test, StaticArrays
a = rand(SMatrix{3,8,Float64})
U, = svd(a)
λs, Ue = eigen(Symmetric(a*transpose(a)))
@test U[:,end] â -Ue[:,1]
Sharing a version âslower than Pythonâ on julia discourse is best way to get the fastest version of your algorithm ever written?
I know, thatâs why I said it. But I managed to swap the two cases in my mind
yeah, sounds like "an idiomatic way to get a lot of support ". really unexpected attention to our small issueâŠ
thanks to all of you, guys! now we are carefully checking all your suggestions. Of course, the most time-consuming call is println
to the terminal.
btw now code is 10x faster than haskell version, will try to do the same trick on the haskell discourse
Could you share the specifics, on how you did the benchmark in the end?
Sure, will post tomorrow. First of all we are going to rewrite all three source bases to exclude random and writing to terminal. Next thing to we are going to replace Haskellâs String
with Text
, because String
in Haskell is linked list of Char
s
Ahhh - recursive discourse sniping.
âVery slow exectution time in comparison to Juliaâ
While splitting up the procedure into random data generation, I remembered that each call to rand
has an overhead as stated in the documentation, so I replaced it with a single call. Now we are yet again faster.
using LinearAlgebra, Statistics, Random, StaticArrays, BenchmarkTools
struct Plane{M,T}
point::SVector{M,T}
normal::SVector{M,T}
end
function mean_center(points::SMatrix{M,N,T}) where {M,N,T}
centroid = mean(points,dims=2)
points_centered = points .- centroid
centroid, points_centered
end
function best_fit_plane(points::SMatrix{M,N,T}) where {M,N,T}
centroid, centered_points = mean_center(points)
m = Symmetric(centered_points*transpose(centered_points))
λs,U = eigen(m)
Plane{M,T}(centroid, U[:,1])
end
function generateRandomBatchData()
MIN = -2000
MAX = 2000
POINTS_PER_BATCH = 2^3
BATCHES_COUNT = div(800000,POINTS_PER_BATCH)
DIM = 3
ELEMENT_TYPE = Float64
randomBatchData = MIN .+ (MAX - MIN)*rand(SMatrix{DIM,POINTS_PER_BATCH,ELEMENT_TYPE},BATCHES_COUNT)
return randomBatchData
end
function coreComputation(batchData)
return map(best_fit_plane,batchData)
end
The data generation is now now significantly faster than the main computation just by eliminating the repeated calls to rand
.
@benchmark generateRandomBatchData()
BenchmarkTools.Trial:
memory estimate: 80.10 MiB
allocs estimate: 399500
--------------
minimum time: 23.918 ms (0.00% GC)
median time: 46.357 ms (0.00% GC)
mean time: 45.681 ms (16.30% GC)
maximum time: 94.225 ms (25.72% GC)
--------------
samples: 110
evals/sample: 1
The main computation is now well below 1s without multi-threading
batchData = generateRandomBatchData()
@benchmark coreComputation($batchData)
BenchmarkTools.Trial:
memory estimate: 215.15 MiB
allocs estimate: 1900002
--------------
minimum time: 326.180 ms (7.11% GC)
median time: 355.630 ms (6.67% GC)
mean time: 357.212 ms (8.27% GC)
maximum time: 418.457 ms (5.71% GC)
--------------
samples: 14
evals/sample: 1
So at best something around 350ms on my laptop, which is not the fastest.