# Help me improve this simple function

Hi,
The following function is used to apply a calibration captured in `correction` to a matrix of 3D sample `points`.

``````using BenchmarkTools
function correct(points, correction)
epoints = hcat(points, ones(size(points, 1)))
return (correction * epoints')'[:, 1:3]
end
points = rand(Float64,(100000,3))
correction = Float64[1 0 0 -0.049191; 0 1 0 0.534935; 0 0 1 0.53503; 0 0 0 1]
@btime correct(points,correction)
``````

Result:

``````julia> @btime correct(points,correction)
1.239 ms (8 allocations: 9.16 MiB)
100000×3 Matrix{Float64}:
0.773386   0.983891  1.03759
0.237514   1.34664   1.31222
0.080002   0.822563  1.03253
0.11178    0.729039  1.40625
0.14609    1.10849   1.07391
...
``````

I get the feeling that this is far from optimal, so I conjure the Julia hive mind to come up with a faster solution if possible.

Thanks,
Gerhard

What happens if you interpolate the variables in your benchmark?
I.e.
`@btime correct(\$points,\$correction)`
In general though I don’t think there is too much room for improvement. On first glance it seems that your code will spend a large majority of the time in the matrix multiplication which will call optimized BLAS routines.

1 Like

There’s no change noticeable change if I interpolate the variables:

``````  1.276 ms (8 allocations: 9.16 MiB)
100000×3 Matrix{Float64}:
``````

In the meantime I found a related SO entry:

I might try to do this manually…

If the correction has always the same structure with 0 and 1, you do a lot of unnecessary calculations and the following will provide the same result

``````function correctb(points, correction)

myv=correction[1:3,4]'
return   points .+ myv

end
``````
``````using BenchmarkTools
using LinearAlgebra
using LoopVectorization

function correct(points, correction)
epoints = hcat(points, ones(size(points, 1)))
return (correction * epoints')'[:, 1:3]
end

function correct_loop(points, correction)
result = similar(points)
nx,ny=size(points)
transposed_epoints = hcat(points, ones(size(points, 1)))'
transposed_correction = correction'
@inbounds for i ∈ 1:ny
@turbo for j ∈ 1:nx
sum=0.0
for k ∈ 1:4
sum+=transposed_correction[k,i]*transposed_epoints[k,j]
end
result[j,i]=sum
end
end
result
end

points = rand(Float64,(100000,3))
correction = Float64[1 0 0 -0.049191; 0 1 0 0.534935; 0 0 1 0.53503; 0 0 0 1]

ref = correct(points,correction)
res = correct_loop(points,correction)

@show norm(ref .- res)

@btime correct_loop(points,correction)
``````
``````julia> include("src/toto.jl")
norm(ref .- res) = 0.0
220.417 μs (6 allocations: 6.10 MiB)
100000×3 Matrix{Float64}:
0.853402   0.893323  0.662189
0.337794   1.45835   0.739322
0.0468517  0.901759  0.844845
0.851418   1.23569   0.666883
0.559083   1.11187   1.3632
0.721222   0.641998  0.685349
``````

Good point, my example of `correction` isn’t representative. All fields can be different from 0 and 1.
Fastest version so far is by adding a `@view`:

``````function correctc(points, correction)
epoints = hcat(points, ones(size(points, 1)))
return @view (correction * epoints')'[:, 1:3]
end
``````

Result:

``````  842.103 μs (6 allocations: 6.87 MiB)
100000×3 view(adjoint(::Matrix{Float64}), :, 1:3) with eltype Float64:
``````

Nice!

``````  709.565 μs (6 allocations: 6.10 MiB)
100000×3 Matrix{Float64}:
``````
``````using BenchmarkTools
using LinearAlgebra
using LoopVectorization

function correct(points, correction)
epoints = hcat(points, ones(size(points, 1)))
return (correction * epoints')'[:, 1:3]
end

function correct_loop(points, correction)
result = similar(points)
nx,ny=size(points)
transposed_epoints = hcat(points, ones(size(points, 1)))'
transposed_correction = correction'
@turbo for i ∈ 1:ny
for j ∈ 1:nx
sum=0.0
for k ∈ 1:4
sum+=transposed_correction[k,i]*transposed_epoints[k,j]
end
result[j,i]=sum
end
end
result
end

points = rand(Float64,(100000,3))
correction = Float64[1 0 0 -0.049191; 0 1 0 0.534935; 0 0 1 0.53503; 0 0 0 1]

ref = correct(points,correction)
res = correct_loop(points,correction)

@show norm(ref .- res)

@btime correct(\$points,\$correction);
@btime correct_loop(\$points,\$correction);
``````
``````julia> include("src/toto.jl")
norm(ref .- res) = 0.0
557.417 μs (8 allocations: 9.16 MiB) # original correct
187.333 μs (6 allocations: 6.10 MiB) # correct_loop
100000×3 Matrix{Float64}:
0.57574    0.64486   1.20726
0.439505   1.32538   0.792494
0.24606    1.11793   0.880008
0.665634   0.67445   0.743179
0.072872   1.14395   0.95964

``````

(Macbook pro M1 max) (no inbounds check)

2 Likes

Not sure if this is much better in terms of speed, but it has fewer allocations:

``````using BenchmarkTools
using StaticArrays
function correct_fast(points, correction)
output = similar(points)
@inbounds for m in 1:size(points, 1)
for n in 1:size(points, 2)
c_mn = correction[n, size(correction, 2)]
for k in 1:size(points, 2)
c_mn += points[m, k] * correction[k, n]
end
output[m, n] = c_mn
end
end
return output
end
``````

This could be still be improved, but the results are:

``````julia> points = rand(Float64,(100000,3));
julia> correction = Float64[1 0 0 -0.049191; 0 1 0 0.534935; 0 0 1 0.53503; 0 0 0 1];
julia> @btime correct(\$points, \$correction)
1.577 ms (8 allocations: 9.16 MiB)
julia> @btime correct_fast(\$points, \$(SMatrix{4,4}(correction)))
590.079 μs (2 allocations: 2.29 MiB)

julia> correct(points, correction) ≈ correct_fast(points, (SMatrix{4,4}(correction)))
true
``````

You can probably use @LaurentPlagne’s suggestion with `LoopVectorization.jl` to make this even faster.

EDIT: Pasted the right code this time 1 Like

Your

doesn’t work for me:

``````julia> correct(points, correction) ≈ correct_fast(points, SMatrix{4,4}(correction))
false
``````

Where is the `correct_fast_no_t()` ?

``````function correct2(points, correction)
res = points * (@view correction[1:3, 1:3])'
res .+= (@view correction[1:3, 4])'
return res
end
``````

about 4x faster (and 4x less allocation) than original, and perhaps even easier to understand code (??).

``````julia> @btime correct(\$points,\$correction);
2.867 ms (8 allocations: 9.16 MiB)

julia> @btime correct2(\$points,\$correction);
726.598 μs (2 allocations: 2.29 MiB)
``````

If the original `points` matrix is not needed afterwards, there is a bit more optimization to squeeze by in-place calculation of correction.

4 Likes

Elegant indeed. Note that this version involving BLAS gemm is multithreaded and slower than the loop on my machine.

I pasted the wrong function, my bad!