Help me improve this simple function

laborg · July 5, 2023, 7:00am

Hi,
The following function is used to apply a calibration captured in correction to a matrix of 3D sample points.

using BenchmarkTools
function correct(points, correction)
    epoints = hcat(points, ones(size(points, 1)))
    return (correction * epoints')'[:, 1:3]
end
points = rand(Float64,(100000,3))
correction = Float64[1 0 0 -0.049191; 0 1 0 0.534935; 0 0 1 0.53503; 0 0 0 1]
@btime correct(points,correction)

Result:

julia> @btime correct(points,correction)
  1.239 ms (8 allocations: 9.16 MiB)
100000×3 Matrix{Float64}:
  0.773386   0.983891  1.03759
  0.237514   1.34664   1.31222
  0.080002   0.822563  1.03253
  0.11178    0.729039  1.40625
  0.14609    1.10849   1.07391
...

I get the feeling that this is far from optimal, so I conjure the Julia hive mind to come up with a faster solution if possible.

Thanks,
Gerhard

Salmon · July 5, 2023, 7:35am

What happens if you interpolate the variables in your benchmark?
I.e.
@btime correct($points,$correction)
In general though I don’t think there is too much room for improvement. On first glance it seems that your code will spend a large majority of the time in the matrix multiplication which will call optimized BLAS routines.

laborg · July 5, 2023, 7:42am

There’s no change noticeable change if I interpolate the variables:

  1.276 ms (8 allocations: 9.16 MiB)
100000×3 Matrix{Float64}:

In the meantime I found a related SO entry:

I might try to do this manually…

JM_Beckers · July 5, 2023, 7:46am

laborg:

using BenchmarkTools
function correct(points, correction)
    epoints = hcat(points, ones(size(points, 1)))
    return (correction * epoints')'[:, 1:3]
end
points = rand(Float64,(100000,3))
correction = Float64[1 0 0 -0.049191; 0 1 0 0.534935; 0 0 1 0.53503; 0 0 0 1]
@btime correct(points,correction)

If the correction has always the same structure with 0 and 1, you do a lot of unnecessary calculations and the following will provide the same result

function correctb(points, correction)
    
    myv=correction[1:3,4]'
    return   points .+ myv
    
end

LaurentPlagne · July 5, 2023, 7:50am

using BenchmarkTools
using LinearAlgebra
using LoopVectorization

function correct(points, correction)
    epoints = hcat(points, ones(size(points, 1)))
    return (correction * epoints')'[:, 1:3]
end

function correct_loop(points, correction)
    result = similar(points)
    nx,ny=size(points)
    transposed_epoints = hcat(points, ones(size(points, 1)))'
    transposed_correction = correction'
    @inbounds for i ∈ 1:ny
        @turbo for j ∈ 1:nx
            sum=0.0
            for k ∈ 1:4
                sum+=transposed_correction[k,i]*transposed_epoints[k,j]
            end
            result[j,i]=sum
        end
    end
    result
end

points = rand(Float64,(100000,3))
correction = Float64[1 0 0 -0.049191; 0 1 0 0.534935; 0 0 1 0.53503; 0 0 0 1]

ref = correct(points,correction)
res = correct_loop(points,correction)

@show norm(ref .- res)


@btime correct_loop(points,correction)

julia> include("src/toto.jl")
norm(ref .- res) = 0.0
  220.417 μs (6 allocations: 6.10 MiB)
100000×3 Matrix{Float64}:
 0.853402   0.893323  0.662189
 0.337794   1.45835   0.739322
 0.0468517  0.901759  0.844845
 0.851418   1.23569   0.666883
 0.559083   1.11187   1.3632
 0.721222   0.641998  0.685349

laborg · July 5, 2023, 7:52am

Good point, my example of correction isn’t representative. All fields can be different from 0 and 1.
Fastest version so far is by adding a @view:

function correctc(points, correction)
    epoints = hcat(points, ones(size(points, 1)))
    return @view (correction * epoints')'[:, 1:3]
end

Result:

  842.103 μs (6 allocations: 6.87 MiB)
100000×3 view(adjoint(::Matrix{Float64}), :, 1:3) with eltype Float64:

laborg · July 5, 2023, 7:54am

Nice!

  709.565 μs (6 allocations: 6.10 MiB)
100000×3 Matrix{Float64}:

LaurentPlagne · July 5, 2023, 8:00am

using BenchmarkTools
using LinearAlgebra
using LoopVectorization

function correct(points, correction)
    epoints = hcat(points, ones(size(points, 1)))
    return (correction * epoints')'[:, 1:3]
end

function correct_loop(points, correction)
    result = similar(points)
    nx,ny=size(points)
    transposed_epoints = hcat(points, ones(size(points, 1)))'
    transposed_correction = correction'
    @turbo for i ∈ 1:ny
        for j ∈ 1:nx
            sum=0.0
            for k ∈ 1:4
                sum+=transposed_correction[k,i]*transposed_epoints[k,j]
            end
            result[j,i]=sum
        end
    end
    result
end

points = rand(Float64,(100000,3))
correction = Float64[1 0 0 -0.049191; 0 1 0 0.534935; 0 0 1 0.53503; 0 0 0 1]

ref = correct(points,correction)
res = correct_loop(points,correction)

@show norm(ref .- res)

@btime correct($points,$correction);
@btime correct_loop($points,$correction);

julia> include("src/toto.jl")
norm(ref .- res) = 0.0
  557.417 μs (8 allocations: 9.16 MiB) # original correct
  187.333 μs (6 allocations: 6.10 MiB) # correct_loop
100000×3 Matrix{Float64}:
  0.57574    0.64486   1.20726
  0.439505   1.32538   0.792494
  0.24606    1.11793   0.880008
  0.665634   0.67445   0.743179
  0.072872   1.14395   0.95964

(Macbook pro M1 max) (no inbounds check)

jmair · July 5, 2023, 8:29am

Not sure if this is much better in terms of speed, but it has fewer allocations:

using BenchmarkTools
using StaticArrays
function correct_fast(points, correction)
    output = similar(points)
    @inbounds for m in 1:size(points, 1)
        for n in 1:size(points, 2)
            c_mn = correction[n, size(correction, 2)]
            for k in 1:size(points, 2)
                c_mn += points[m, k] * correction[k, n]
            end
            output[m, n] = c_mn
        end
    end
    return output
end

This could be still be improved, but the results are:

julia> points = rand(Float64,(100000,3));
julia> correction = Float64[1 0 0 -0.049191; 0 1 0 0.534935; 0 0 1 0.53503; 0 0 0 1];
julia> @btime correct($points, $correction)
  1.577 ms (8 allocations: 9.16 MiB)
julia> @btime correct_fast($points, $(SMatrix{4,4}(correction)))
  590.079 μs (2 allocations: 2.29 MiB)

julia> correct(points, correction) ≈ correct_fast(points, (SMatrix{4,4}(correction)))
true

You can probably use @LaurentPlagne’s suggestion with LoopVectorization.jl to make this even faster.

EDIT: Pasted the right code this time

laborg · July 5, 2023, 8:41am

Your

doesn’t work for me:

julia> correct(points, correction) ≈ correct_fast(points, SMatrix{4,4}(correction))
false

Where is the correct_fast_no_t() ?

Dan · July 5, 2023, 9:50am

How about:

function correct2(points, correction)
    res = points * (@view correction[1:3, 1:3])'
    res .+= (@view correction[1:3, 4])'
    return res
end

about 4x faster (and 4x less allocation) than original, and perhaps even easier to understand code (??).

julia> @btime correct($points,$correction);
  2.867 ms (8 allocations: 9.16 MiB)

julia> @btime correct2($points,$correction);
  726.598 μs (2 allocations: 2.29 MiB)

If the original points matrix is not needed afterwards, there is a bit more optimization to squeeze by in-place calculation of correction.

LaurentPlagne · July 5, 2023, 1:08pm

Elegant indeed. Note that this version involving BLAS gemm is multithreaded and slower than the loop on my machine.

jmair · July 5, 2023, 2:08pm

I pasted the wrong function, my bad!

Topic		Replies	Views
How to accelerate matrix operations(multiplication, add, inverse) in a for loop? Performance performance , matlab	23	6691	September 2, 2018
Fastest way to index matrices Performance	6	1012	March 3, 2019
How to improve the performance while do matrix operation Performance	13	818	August 27, 2018
Element by element wise multiplication and addition/subtraction of 3d matrices Numerics question	19	1021	June 14, 2023
Speed up applying a function to matrices Performance	8	620	August 13, 2021

Help me improve this simple function

Related topics