Hi,
The following function is used to apply a calibration captured in correction
to a matrix of 3D sample points
.
using BenchmarkTools
function correct(points, correction)
epoints = hcat(points, ones(size(points, 1)))
return (correction * epoints')'[:, 1:3]
end
points = rand(Float64,(100000,3))
correction = Float64[1 0 0 -0.049191; 0 1 0 0.534935; 0 0 1 0.53503; 0 0 0 1]
@btime correct(points,correction)
Result:
julia> @btime correct(points,correction)
1.239 ms (8 allocations: 9.16 MiB)
100000×3 Matrix{Float64}:
0.773386 0.983891 1.03759
0.237514 1.34664 1.31222
0.080002 0.822563 1.03253
0.11178 0.729039 1.40625
0.14609 1.10849 1.07391
...
I get the feeling that this is far from optimal, so I conjure the Julia hive mind to come up with a faster solution if possible.
Thanks,
Gerhard
What happens if you interpolate the variables in your benchmark?
I.e.
@btime correct($points,$correction)
In general though I don’t think there is too much room for improvement. On first glance it seems that your code will spend a large majority of the time in the matrix multiplication which will call optimized BLAS routines.
1 Like
There’s no change noticeable change if I interpolate the variables:
1.276 ms (8 allocations: 9.16 MiB)
100000×3 Matrix{Float64}:
In the meantime I found a related SO entry:
julia
I might try to do this manually…
laborg:
using BenchmarkTools
function correct(points, correction)
epoints = hcat(points, ones(size(points, 1)))
return (correction * epoints')'[:, 1:3]
end
points = rand(Float64,(100000,3))
correction = Float64[1 0 0 -0.049191; 0 1 0 0.534935; 0 0 1 0.53503; 0 0 0 1]
@btime correct(points,correction)
If the correction has always the same structure with 0 and 1, you do a lot of unnecessary calculations and the following will provide the same result
function correctb(points, correction)
myv=correction[1:3,4]'
return points .+ myv
end
using BenchmarkTools
using LinearAlgebra
using LoopVectorization
function correct(points, correction)
epoints = hcat(points, ones(size(points, 1)))
return (correction * epoints')'[:, 1:3]
end
function correct_loop(points, correction)
result = similar(points)
nx,ny=size(points)
transposed_epoints = hcat(points, ones(size(points, 1)))'
transposed_correction = correction'
@inbounds for i ∈ 1:ny
@turbo for j ∈ 1:nx
sum=0.0
for k ∈ 1:4
sum+=transposed_correction[k,i]*transposed_epoints[k,j]
end
result[j,i]=sum
end
end
result
end
points = rand(Float64,(100000,3))
correction = Float64[1 0 0 -0.049191; 0 1 0 0.534935; 0 0 1 0.53503; 0 0 0 1]
ref = correct(points,correction)
res = correct_loop(points,correction)
@show norm(ref .- res)
@btime correct_loop(points,correction)
julia> include("src/toto.jl")
norm(ref .- res) = 0.0
220.417 μs (6 allocations: 6.10 MiB)
100000×3 Matrix{Float64}:
0.853402 0.893323 0.662189
0.337794 1.45835 0.739322
0.0468517 0.901759 0.844845
0.851418 1.23569 0.666883
0.559083 1.11187 1.3632
0.721222 0.641998 0.685349
Good point, my example of correction
isn’t representative. All fields can be different from 0 and 1.
Fastest version so far is by adding a @view
:
function correctc(points, correction)
epoints = hcat(points, ones(size(points, 1)))
return @view (correction * epoints')'[:, 1:3]
end
Result:
842.103 μs (6 allocations: 6.87 MiB)
100000×3 view(adjoint(::Matrix{Float64}), :, 1:3) with eltype Float64:
Nice!
709.565 μs (6 allocations: 6.10 MiB)
100000×3 Matrix{Float64}:
using BenchmarkTools
using LinearAlgebra
using LoopVectorization
function correct(points, correction)
epoints = hcat(points, ones(size(points, 1)))
return (correction * epoints')'[:, 1:3]
end
function correct_loop(points, correction)
result = similar(points)
nx,ny=size(points)
transposed_epoints = hcat(points, ones(size(points, 1)))'
transposed_correction = correction'
@turbo for i ∈ 1:ny
for j ∈ 1:nx
sum=0.0
for k ∈ 1:4
sum+=transposed_correction[k,i]*transposed_epoints[k,j]
end
result[j,i]=sum
end
end
result
end
points = rand(Float64,(100000,3))
correction = Float64[1 0 0 -0.049191; 0 1 0 0.534935; 0 0 1 0.53503; 0 0 0 1]
ref = correct(points,correction)
res = correct_loop(points,correction)
@show norm(ref .- res)
@btime correct($points,$correction);
@btime correct_loop($points,$correction);
julia> include("src/toto.jl")
norm(ref .- res) = 0.0
557.417 μs (8 allocations: 9.16 MiB) # original correct
187.333 μs (6 allocations: 6.10 MiB) # correct_loop
100000×3 Matrix{Float64}:
0.57574 0.64486 1.20726
0.439505 1.32538 0.792494
0.24606 1.11793 0.880008
0.665634 0.67445 0.743179
0.072872 1.14395 0.95964
(Macbook pro M1 max) (no inbounds check)
2 Likes
jmair
July 5, 2023, 8:29am
9
Not sure if this is much better in terms of speed, but it has fewer allocations:
using BenchmarkTools
using StaticArrays
function correct_fast(points, correction)
output = similar(points)
@inbounds for m in 1:size(points, 1)
for n in 1:size(points, 2)
c_mn = correction[n, size(correction, 2)]
for k in 1:size(points, 2)
c_mn += points[m, k] * correction[k, n]
end
output[m, n] = c_mn
end
end
return output
end
This could be still be improved, but the results are:
julia> points = rand(Float64,(100000,3));
julia> correction = Float64[1 0 0 -0.049191; 0 1 0 0.534935; 0 0 1 0.53503; 0 0 0 1];
julia> @btime correct($points, $correction)
1.577 ms (8 allocations: 9.16 MiB)
julia> @btime correct_fast($points, $(SMatrix{4,4}(correction)))
590.079 μs (2 allocations: 2.29 MiB)
julia> correct(points, correction) ≈ correct_fast(points, (SMatrix{4,4}(correction)))
true
You can probably use @LaurentPlagne ’s suggestion with LoopVectorization.jl
to make this even faster.
EDIT: Pasted the right code this time
1 Like
laborg
July 5, 2023, 8:41am
10
Your
jmair:
correct_fast
doesn’t work for me:
julia> correct(points, correction) ≈ correct_fast(points, SMatrix{4,4}(correction))
false
Where is the correct_fast_no_t()
?
Dan
July 5, 2023, 9:50am
11
How about:
function correct2(points, correction)
res = points * (@view correction[1:3, 1:3])'
res .+= (@view correction[1:3, 4])'
return res
end
about 4x faster (and 4x less allocation) than original, and perhaps even easier to understand code (??).
julia> @btime correct($points,$correction);
2.867 ms (8 allocations: 9.16 MiB)
julia> @btime correct2($points,$correction);
726.598 μs (2 allocations: 2.29 MiB)
If the original points
matrix is not needed afterwards, there is a bit more optimization to squeeze by in-place calculation of correction.
4 Likes
Elegant indeed. Note that this version involving BLAS gemm is multithreaded and slower than the loop on my machine.
jmair
July 5, 2023, 2:08pm
13
I pasted the wrong function, my bad!