Very slow execution time in comparison even to Python

I would be a bit careful about using SMatrix here. SMatrix is for cases where the compiler can be sure of the sizes, and the sizes should be small. In this particular case POINTS_PER_BATCH is small and fixed, but at least be aware of this.

1 Like

I would be a bit careful about using SMatrix here. SMatrix is for cases where the compiler can be sure of the sizes, and the sizes should be small. In this particular case POINTS_PER_BATCH is small and fixed, but at least be aware of this.

For this example everything is known, since POINTS_PER_BATCH is known at compile time, but in general this can lead to problems. I am aware of this.

How fast does this run, if you use the above mentioned within your code?

It’s even better than that. Since MM' is symmetric and positive definite, the SVD is equivalent to the eigen factorization, and StaticArrays has a specialized eigen for 3x3 matrices. Only problem is that svd and eigen doesn’t use the same ordering of singular/eigen values and vectors, so you need to take that into account.

4 Likes

After playing around with this a bit, I think it should be possible to get the runtime down to approximately 1 second + printing (on my old laptop) for a batch size of 10^6. Then multithreading comes on top of that.

That’s compared to an original runtime of 18 minutes, including printing and compilation.

10 Likes

That does indeed work. It seems the eigenvalues and corresponding vectors are ordered in reverse of what the SVD does. So here my updated code. The speedup does not so much depend on POINTS_PER_BATCH. For POINTS_PER_BATCH of 8 and 256 the “fast SVD trick” does provide a speed up of about 2.

using LinearAlgebra, Statistics, Random, StaticArrays, BenchmarkTools

struct Plane
    point::SVector{3,Float64}
    normal::SVector{3,Float64}
end

function mean_center(points::SMatrix)
    centroid = mean(points,dims=2)
    points_centered = points .- centroid
    centroid, points_centered
end

function best_fit_plane(points::SMatrix)
    centroid, centered_points = mean_center(points)

	# slowest
	#u = svd(centered_points).U[:,end]

	# better
	#m = centered_points*transpose(centered_points)
	#u = svd(m).U[:,end]

	# best
	#m = Symmetric(centered_points*transpose(centered_points))
	#u = eigen(m).vectors[:,1]

    Plane(centroid, u)
end

function main()
	MIN = -2000
	MAX = 2000
	BATCHES_COUNT = 1250
	POINTS_PER_BATCH = 2^3 # or 2^8
	BATCHES_COUNT = div(2^16,POINTS_PER_BATCH)
	
	generateBatch() = MIN .+ (MAX - MIN)*rand(SMatrix{3,POINTS_PER_BATCH,Float64})

	out = Vector{Plane}(undef,BATCHES_COUNT) # initialize output
	for i in eachindex(out)
		b = generateBatch()
		out[i] = best_fit_plane(b)
	end
	return out
end

@benchmark main()

We start off here for POINTS_PER_BATCH = 8 using the slowest approach

BenchmarkTools.Trial: 
  memory estimate:  62.38 MiB
  allocs estimate:  434178
  --------------
  minimum time:     129.477 ms (2.97% GC)
  median time:      139.976 ms (2.76% GC)
  mean time:        146.015 ms (3.71% GC)
  maximum time:     214.297 ms (1.98% GC)
  --------------
  samples:          35
  evals/sample:     1

The slow approach is slightly better

BenchmarkTools.Trial: 
  memory estimate:  58.63 MiB
  allocs estimate:  450562
  --------------
  minimum time:     111.597 ms (3.00% GC)
  median time:      116.901 ms (3.71% GC)
  mean time:        122.655 ms (4.18% GC)
  maximum time:     201.346 ms (4.26% GC)
  --------------
  samples:          41
  evals/sample:     1

The fast approach yields the best results

BenchmarkTools.Trial: 
  memory estimate:  36.88 MiB
  allocs estimate:  393218
  --------------
  minimum time:     57.961 ms (0.00% GC)
  median time:      66.490 ms (6.86% GC)
  mean time:        67.735 ms (6.30% GC)
  maximum time:     87.808 ms (7.00% GC)
  --------------
  samples:          74
  evals/sample:     1

We can get better results for larger POINTS_PER_BATCH = 256 using the slowest approach

BenchmarkTools.Trial: 
  memory estimate:  25.48 MiB
  allocs estimate:  13825
  --------------
  minimum time:     14.025 ms (0.00% GC)
  median time:      17.141 ms (11.42% GC)
  mean time:        16.768 ms (7.38% GC)
  maximum time:     25.514 ms (9.19% GC)
  --------------
  samples:          299
  evals/sample:     1

The slow approach is slightly better

BenchmarkTools.Trial: 
  memory estimate:  19.49 MiB
  allocs estimate:  14337
  --------------
  minimum time:     8.721 ms (0.00% GC)
  median time:      11.221 ms (0.00% GC)
  mean time:        11.340 ms (8.76% GC)
  maximum time:     17.238 ms (14.03% GC)
  --------------
  samples:          441
  evals/sample:     1

The fast approach yields the best results

BenchmarkTools.Trial: 
  memory estimate:  18.81 MiB
  allocs estimate:  12545
  --------------
  minimum time:     6.666 ms (0.00% GC)
  median time:      8.965 ms (0.00% GC)
  mean time:        8.979 ms (10.28% GC)
  maximum time:     14.986 ms (26.97% GC)
  --------------
  samples:          557
  evals/sample:     1

Around 1 sec on my machine for a batch size 2^{20} and POINTS_PER_BATCH = 8. No parallelization either.

2 Likes

It looks to me like mean(points; dims=Val(2)) improves performance slightly (actually, it dramatically speeds up mean_center, but the overall performance isn’t so much affected.)

I think it should be the last, not the first eigenvector. Also, I think it’s better to not access the fields like that. You can either use

_, V = eigen(m)
u = V[:, end]

or

u = eigvecs(m)[:, end]
1 Like

Cunningham’s Law states “the best way to get the right answer on the internet is not to ask a question; it’s to post the wrong answer.” I think this thread suggests a new corollary :wink:

13 Likes

This is great from the first time unoptimized code posted here has been optimized to death. The main surprising part is that so far this thread have gotten to the typical 1000x speedup :wink:

9 Likes

I have tried this out and it seems eigen and svd have a different order. svd puts the largest eigenvalue first, whereas eigen of a symmetric 3\times 3 matrix seems to put the largest eigenvalue last.

using Test, StaticArrays
a = rand(SMatrix{3,8,Float64})
U, = svd(a)
λs, Ue = eigen(Symmetric(a*transpose(a)))
@test U[:,end] ≈ -Ue[:,1]
1 Like

Sharing a version “slower than Python” on julia discourse is best way to get the fastest version of your algorithm ever written?

16 Likes

I know, that’s why I said it. But I managed to swap the two cases in my mind :man_facepalming:

1 Like

yeah, sounds like "an idiomatic way to get a lot of support ". really unexpected attention to our small issue


thanks to all of you, guys! now we are carefully checking all your suggestions. Of course, the most time-consuming call is println to the terminal.

btw now code is 10x faster than haskell version, will try to do the same trick on the haskell discourse

15 Likes

Could you share the specifics, on how you did the benchmark in the end?

Sure, will post tomorrow. First of all we are going to rewrite all three source bases to exclude random and writing to terminal. Next thing to we are going to replace Haskell’s String with Text, because String in Haskell is linked list of Chars

Ahhh - recursive discourse sniping.

“Very slow exectution time in comparison to Julia”

9 Likes

While splitting up the procedure into random data generation, I remembered that each call to rand has an overhead as stated in the documentation, so I replaced it with a single call. Now we are yet again faster.

using LinearAlgebra, Statistics, Random, StaticArrays, BenchmarkTools

struct Plane{M,T}
    point::SVector{M,T}
    normal::SVector{M,T}
end

function mean_center(points::SMatrix{M,N,T}) where {M,N,T}
    centroid = mean(points,dims=2)
    points_centered = points .- centroid
    centroid, points_centered
end

function best_fit_plane(points::SMatrix{M,N,T}) where {M,N,T}
    centroid, centered_points = mean_center(points)
	m = Symmetric(centered_points*transpose(centered_points))
	λs,U = eigen(m)
    Plane{M,T}(centroid, U[:,1])
end

function generateRandomBatchData()
	MIN = -2000
	MAX = 2000
	POINTS_PER_BATCH = 2^3
	BATCHES_COUNT = div(800000,POINTS_PER_BATCH)
	DIM = 3
	ELEMENT_TYPE = Float64
	
	randomBatchData = MIN .+ (MAX - MIN)*rand(SMatrix{DIM,POINTS_PER_BATCH,ELEMENT_TYPE},BATCHES_COUNT)
	return randomBatchData
end

function coreComputation(batchData)
	return map(best_fit_plane,batchData)
end

The data generation is now now significantly faster than the main computation just by eliminating the repeated calls to rand.

@benchmark generateRandomBatchData()
BenchmarkTools.Trial: 
  memory estimate:  80.10 MiB
  allocs estimate:  399500
  --------------
  minimum time:     23.918 ms (0.00% GC)
  median time:      46.357 ms (0.00% GC)
  mean time:        45.681 ms (16.30% GC)
  maximum time:     94.225 ms (25.72% GC)
  --------------
  samples:          110
  evals/sample:     1

The main computation is now well below 1s without multi-threading

batchData = generateRandomBatchData()
@benchmark coreComputation($batchData)
BenchmarkTools.Trial: 
  memory estimate:  215.15 MiB
  allocs estimate:  1900002
  --------------
  minimum time:     326.180 ms (7.11% GC)
  median time:      355.630 ms (6.67% GC)
  mean time:        357.212 ms (8.27% GC)
  maximum time:     418.457 ms (5.71% GC)
  --------------
  samples:          14
  evals/sample:     1

So at best something around 350ms on my laptop, which is not the fastest.

6 Likes