Help me improve this simple function

Hi,
The following function is used to apply a calibration captured in correction to a matrix of 3D sample points.

using BenchmarkTools
function correct(points, correction)
    epoints = hcat(points, ones(size(points, 1)))
    return (correction * epoints')'[:, 1:3]
end
points = rand(Float64,(100000,3))
correction = Float64[1 0 0 -0.049191; 0 1 0 0.534935; 0 0 1 0.53503; 0 0 0 1]
@btime correct(points,correction)

Result:

julia> @btime correct(points,correction)
  1.239 ms (8 allocations: 9.16 MiB)
100000×3 Matrix{Float64}:
  0.773386   0.983891  1.03759
  0.237514   1.34664   1.31222
  0.080002   0.822563  1.03253
  0.11178    0.729039  1.40625
  0.14609    1.10849   1.07391
...

I get the feeling that this is far from optimal, so I conjure the Julia hive mind to come up with a faster solution if possible.

Thanks,
Gerhard

What happens if you interpolate the variables in your benchmark?
I.e.
@btime correct($points,$correction)
In general though I don’t think there is too much room for improvement. On first glance it seems that your code will spend a large majority of the time in the matrix multiplication which will call optimized BLAS routines.

1 Like

There’s no change noticeable change if I interpolate the variables:

  1.276 ms (8 allocations: 9.16 MiB)
100000×3 Matrix{Float64}:

In the meantime I found a related SO entry:

I might try to do this manually…

If the correction has always the same structure with 0 and 1, you do a lot of unnecessary calculations and the following will provide the same result

function correctb(points, correction)
    
    myv=correction[1:3,4]'
    return   points .+ myv
    
end
using BenchmarkTools
using LinearAlgebra
using LoopVectorization

function correct(points, correction)
    epoints = hcat(points, ones(size(points, 1)))
    return (correction * epoints')'[:, 1:3]
end

function correct_loop(points, correction)
    result = similar(points)
    nx,ny=size(points)
    transposed_epoints = hcat(points, ones(size(points, 1)))'
    transposed_correction = correction'
    @inbounds for i ∈ 1:ny
        @turbo for j ∈ 1:nx
            sum=0.0
            for k ∈ 1:4
                sum+=transposed_correction[k,i]*transposed_epoints[k,j]
            end
            result[j,i]=sum
        end
    end
    result
end

points = rand(Float64,(100000,3))
correction = Float64[1 0 0 -0.049191; 0 1 0 0.534935; 0 0 1 0.53503; 0 0 0 1]

ref = correct(points,correction)
res = correct_loop(points,correction)

@show norm(ref .- res)


@btime correct_loop(points,correction)
julia> include("src/toto.jl")
norm(ref .- res) = 0.0
  220.417 μs (6 allocations: 6.10 MiB)
100000×3 Matrix{Float64}:
 0.853402   0.893323  0.662189
 0.337794   1.45835   0.739322
 0.0468517  0.901759  0.844845
 0.851418   1.23569   0.666883
 0.559083   1.11187   1.3632
 0.721222   0.641998  0.685349

Good point, my example of correction isn’t representative. All fields can be different from 0 and 1.
Fastest version so far is by adding a @view:

function correctc(points, correction)
    epoints = hcat(points, ones(size(points, 1)))
    return @view (correction * epoints')'[:, 1:3]
end

Result:

  842.103 μs (6 allocations: 6.87 MiB)
100000×3 view(adjoint(::Matrix{Float64}), :, 1:3) with eltype Float64:

Nice!

  709.565 μs (6 allocations: 6.10 MiB)
100000×3 Matrix{Float64}:
using BenchmarkTools
using LinearAlgebra
using LoopVectorization

function correct(points, correction)
    epoints = hcat(points, ones(size(points, 1)))
    return (correction * epoints')'[:, 1:3]
end

function correct_loop(points, correction)
    result = similar(points)
    nx,ny=size(points)
    transposed_epoints = hcat(points, ones(size(points, 1)))'
    transposed_correction = correction'
    @turbo for i ∈ 1:ny
        for j ∈ 1:nx
            sum=0.0
            for k ∈ 1:4
                sum+=transposed_correction[k,i]*transposed_epoints[k,j]
            end
            result[j,i]=sum
        end
    end
    result
end

points = rand(Float64,(100000,3))
correction = Float64[1 0 0 -0.049191; 0 1 0 0.534935; 0 0 1 0.53503; 0 0 0 1]

ref = correct(points,correction)
res = correct_loop(points,correction)

@show norm(ref .- res)

@btime correct($points,$correction);
@btime correct_loop($points,$correction);
julia> include("src/toto.jl")
norm(ref .- res) = 0.0
  557.417 μs (8 allocations: 9.16 MiB) # original correct
  187.333 μs (6 allocations: 6.10 MiB) # correct_loop
100000×3 Matrix{Float64}:
  0.57574    0.64486   1.20726
  0.439505   1.32538   0.792494
  0.24606    1.11793   0.880008
  0.665634   0.67445   0.743179
  0.072872   1.14395   0.95964

(Macbook pro M1 max) (no inbounds check)

2 Likes

Not sure if this is much better in terms of speed, but it has fewer allocations:

using BenchmarkTools
using StaticArrays
function correct_fast(points, correction)
    output = similar(points)
    @inbounds for m in 1:size(points, 1)
        for n in 1:size(points, 2)
            c_mn = correction[n, size(correction, 2)]
            for k in 1:size(points, 2)
                c_mn += points[m, k] * correction[k, n]
            end
            output[m, n] = c_mn
        end
    end
    return output
end

This could be still be improved, but the results are:

julia> points = rand(Float64,(100000,3));
julia> correction = Float64[1 0 0 -0.049191; 0 1 0 0.534935; 0 0 1 0.53503; 0 0 0 1];
julia> @btime correct($points, $correction)
  1.577 ms (8 allocations: 9.16 MiB)
julia> @btime correct_fast($points, $(SMatrix{4,4}(correction)))
  590.079 μs (2 allocations: 2.29 MiB)

julia> correct(points, correction) ≈ correct_fast(points, (SMatrix{4,4}(correction)))
true

You can probably use @LaurentPlagne’s suggestion with LoopVectorization.jl to make this even faster.

EDIT: Pasted the right code this time :slight_smile:

1 Like

Your

doesn’t work for me:

julia> correct(points, correction) ≈ correct_fast(points, SMatrix{4,4}(correction))
false

Where is the correct_fast_no_t() ?

How about:

function correct2(points, correction)
    res = points * (@view correction[1:3, 1:3])'
    res .+= (@view correction[1:3, 4])'
    return res
end

about 4x faster (and 4x less allocation) than original, and perhaps even easier to understand code (??).

julia> @btime correct($points,$correction);
  2.867 ms (8 allocations: 9.16 MiB)

julia> @btime correct2($points,$correction);
  726.598 μs (2 allocations: 2.29 MiB)

If the original points matrix is not needed afterwards, there is a bit more optimization to squeeze by in-place calculation of correction.

4 Likes

Elegant indeed. Note that this version involving BLAS gemm is multithreaded and slower than the loop on my machine.

I pasted the wrong function, my bad!