Hi,
The following function is used to apply a calibration captured in `correction`

to a matrix of 3D sample `points`

.

```
using BenchmarkTools
function correct(points, correction)
epoints = hcat(points, ones(size(points, 1)))
return (correction * epoints')'[:, 1:3]
end
points = rand(Float64,(100000,3))
correction = Float64[1 0 0 -0.049191; 0 1 0 0.534935; 0 0 1 0.53503; 0 0 0 1]
@btime correct(points,correction)
```

Result:

```
julia> @btime correct(points,correction)
1.239 ms (8 allocations: 9.16 MiB)
100000×3 Matrix{Float64}:
0.773386 0.983891 1.03759
0.237514 1.34664 1.31222
0.080002 0.822563 1.03253
0.11178 0.729039 1.40625
0.14609 1.10849 1.07391
...
```

I get the feeling that this is far from optimal, so I conjure the Julia hive mind to come up with a faster solution if possible.

Thanks,
Gerhard

What happens if you interpolate the variables in your benchmark?
I.e.
`@btime correct($points,$correction)`

In general though I don’t think there is too much room for improvement. On first glance it seems that your code will spend a large majority of the time in the matrix multiplication which will call optimized BLAS routines.

1 Like

There’s no change noticeable change if I interpolate the variables:

```
1.276 ms (8 allocations: 9.16 MiB)
100000×3 Matrix{Float64}:
```

In the meantime I found a related SO entry:

julia

I might try to do this manually…

laborg:

```
using BenchmarkTools
function correct(points, correction)
epoints = hcat(points, ones(size(points, 1)))
return (correction * epoints')'[:, 1:3]
end
points = rand(Float64,(100000,3))
correction = Float64[1 0 0 -0.049191; 0 1 0 0.534935; 0 0 1 0.53503; 0 0 0 1]
@btime correct(points,correction)
```

If the correction has always the same structure with 0 and 1, you do a lot of unnecessary calculations and the following will provide the same result

```
function correctb(points, correction)
myv=correction[1:3,4]'
return points .+ myv
end
```

```
using BenchmarkTools
using LinearAlgebra
using LoopVectorization
function correct(points, correction)
epoints = hcat(points, ones(size(points, 1)))
return (correction * epoints')'[:, 1:3]
end
function correct_loop(points, correction)
result = similar(points)
nx,ny=size(points)
transposed_epoints = hcat(points, ones(size(points, 1)))'
transposed_correction = correction'
@inbounds for i ∈ 1:ny
@turbo for j ∈ 1:nx
sum=0.0
for k ∈ 1:4
sum+=transposed_correction[k,i]*transposed_epoints[k,j]
end
result[j,i]=sum
end
end
result
end
points = rand(Float64,(100000,3))
correction = Float64[1 0 0 -0.049191; 0 1 0 0.534935; 0 0 1 0.53503; 0 0 0 1]
ref = correct(points,correction)
res = correct_loop(points,correction)
@show norm(ref .- res)
@btime correct_loop(points,correction)
```

```
julia> include("src/toto.jl")
norm(ref .- res) = 0.0
220.417 μs (6 allocations: 6.10 MiB)
100000×3 Matrix{Float64}:
0.853402 0.893323 0.662189
0.337794 1.45835 0.739322
0.0468517 0.901759 0.844845
0.851418 1.23569 0.666883
0.559083 1.11187 1.3632
0.721222 0.641998 0.685349
```

Good point, my example of `correction`

isn’t representative. All fields can be different from 0 and 1.
Fastest version so far is by adding a `@view`

:

```
function correctc(points, correction)
epoints = hcat(points, ones(size(points, 1)))
return @view (correction * epoints')'[:, 1:3]
end
```

Result:

```
842.103 μs (6 allocations: 6.87 MiB)
100000×3 view(adjoint(::Matrix{Float64}), :, 1:3) with eltype Float64:
```

Nice!

```
709.565 μs (6 allocations: 6.10 MiB)
100000×3 Matrix{Float64}:
```

```
using BenchmarkTools
using LinearAlgebra
using LoopVectorization
function correct(points, correction)
epoints = hcat(points, ones(size(points, 1)))
return (correction * epoints')'[:, 1:3]
end
function correct_loop(points, correction)
result = similar(points)
nx,ny=size(points)
transposed_epoints = hcat(points, ones(size(points, 1)))'
transposed_correction = correction'
@turbo for i ∈ 1:ny
for j ∈ 1:nx
sum=0.0
for k ∈ 1:4
sum+=transposed_correction[k,i]*transposed_epoints[k,j]
end
result[j,i]=sum
end
end
result
end
points = rand(Float64,(100000,3))
correction = Float64[1 0 0 -0.049191; 0 1 0 0.534935; 0 0 1 0.53503; 0 0 0 1]
ref = correct(points,correction)
res = correct_loop(points,correction)
@show norm(ref .- res)
@btime correct($points,$correction);
@btime correct_loop($points,$correction);
```

```
julia> include("src/toto.jl")
norm(ref .- res) = 0.0
557.417 μs (8 allocations: 9.16 MiB) # original correct
187.333 μs (6 allocations: 6.10 MiB) # correct_loop
100000×3 Matrix{Float64}:
0.57574 0.64486 1.20726
0.439505 1.32538 0.792494
0.24606 1.11793 0.880008
0.665634 0.67445 0.743179
0.072872 1.14395 0.95964
```

(Macbook pro M1 max) (no inbounds check)

2 Likes

jmair
July 5, 2023, 8:29am
9
Not sure if this is much better in terms of speed, but it has fewer allocations:

```
using BenchmarkTools
using StaticArrays
function correct_fast(points, correction)
output = similar(points)
@inbounds for m in 1:size(points, 1)
for n in 1:size(points, 2)
c_mn = correction[n, size(correction, 2)]
for k in 1:size(points, 2)
c_mn += points[m, k] * correction[k, n]
end
output[m, n] = c_mn
end
end
return output
end
```

This could be still be improved, but the results are:

```
julia> points = rand(Float64,(100000,3));
julia> correction = Float64[1 0 0 -0.049191; 0 1 0 0.534935; 0 0 1 0.53503; 0 0 0 1];
julia> @btime correct($points, $correction)
1.577 ms (8 allocations: 9.16 MiB)
julia> @btime correct_fast($points, $(SMatrix{4,4}(correction)))
590.079 μs (2 allocations: 2.29 MiB)
julia> correct(points, correction) ≈ correct_fast(points, (SMatrix{4,4}(correction)))
true
```

You can probably use @LaurentPlagne ’s suggestion with `LoopVectorization.jl`

to make this even faster.

EDIT: Pasted the right code this time

1 Like

laborg
July 5, 2023, 8:41am
10
Your

jmair:

`correct_fast`

doesn’t work for me:

```
julia> correct(points, correction) ≈ correct_fast(points, SMatrix{4,4}(correction))
false
```

Where is the `correct_fast_no_t()`

?

Dan
July 5, 2023, 9:50am
11
How about:

```
function correct2(points, correction)
res = points * (@view correction[1:3, 1:3])'
res .+= (@view correction[1:3, 4])'
return res
end
```

about 4x faster (and 4x less allocation) than original, and perhaps even easier to understand code (??).

```
julia> @btime correct($points,$correction);
2.867 ms (8 allocations: 9.16 MiB)
julia> @btime correct2($points,$correction);
726.598 μs (2 allocations: 2.29 MiB)
```

If the original `points`

matrix is not needed afterwards, there is a bit more optimization to squeeze by in-place calculation of correction.

4 Likes

Elegant indeed. Note that this version involving BLAS gemm is multithreaded and slower than the loop on my machine.

jmair
July 5, 2023, 2:08pm
13
I pasted the wrong function, my bad!