3D point rotation

I have a 2D array x of size (r, c), where c = 3*nm, so stacked 3D points. I’m able to rotate it around axis based on some θ value, but this code is not flying very fast. I tried using Quaternions for this but I didn’t find an efficient way either. Any ideas? Thanks!

θ = (θ*π)/180
r,c = size(x)
nm = ndims
x3 = reshape(x,r,3,nm)
point = nanmean.([x3[:,k,:] for k in 1:3])
R = AngleAxis(θ, axis...)
r3 = similar(x3)
point_reshaped = reshape(point, 1, 3)
for j in 1:nm
r3[:,:,j] = (R * (x3[:,:,j] .- point_reshaped)')' .+ point'
end
x2 = reshape(r3,r,c)

Hi, welcome! I’m not familiar with AngleAxis and related and I think it would be beneficial to make your example into a MWE with imports etc so others can chime in with more detailed tips, but I wanted to point you to the performance tips: Performance Tips · The Julia Language.

In particular here: use views instead of slice indexing, and use .= when writing to r3

1 Like

Did you try Rotations.jl with StaticArrays.jl? It is usually better to store each vector as a static vector, and collect all vectors into a dynamic vector.

By the way, I also see a double transpose in there: you may wish to transpose R, point and point_reshaped instead together with some linear algebra to avoid the transposing altogether.

Thanks. Here’s a MWE:

using Rotations,NaNStatistics,BenchmarkTools
function rotate(x,θ,axis)
θ = (θ*π)/180
r,c = size(x)
nm = div(c,3)
x3 = reshape(x,r,3,nm)
point = nanmean.([x3[:,k,:] for k in 1:3])
R = AngleAxis(θ, axis...)
r3 = similar(x3)
point_reshaped = reshape(point, 1, 3)
for j in 1:nm
r3[:,:,j] = (R * (x3[:,:,j] .- point_reshaped)')' .+ point'
end
x2 = reshape(r3,r,c)
end
@btime rotate(randn(20700,63),90,[1 0 0])

68.665 ms (279 allocations: 71.94 MiB)
I tried using StaticArrays.jl, Quaternions.jl (Rotate a vector with a quaternion), inbounds, @views, but the code above was the fastest. Moar performance is needed!

OK, this is a bit faster already:

using Rotations,NaNStatistics,BenchmarkTools,LinearAlgebra
function rotate(x,θ,axis)
    θ = (θ*π)/180
    r,c = size(x)
    nm = div(c,3)
    x3 = reshape(x,r,3,nm)
    point = nanmean.([x3[:,k,:] for k in 1:3])
    R = AngleAxis(θ, axis...)
    r3 = similar(x3)
    point_reshaped = reshape(point, 1, 3)
    tmp = Matrix{Float64}(undef,r,3)
    @views for j in 1:nm
        tmp = x3[:,:,j] .- point_reshaped
        mul!(r3[:,:,j], tmp, R')
        r3[:,:,j] .+= point_reshaped
    end
    x2 = reshape(r3,r,c)
end
@btime rotate(randn(20700,63),90,[1 0 0])

4.902 ms (90 allocations: 40.44 MiB)

Presumably you want.= here.


If you go full-on SVector (or Point3f), representing your matrix of 3D points simply using a Matrix{SVector{3, Float64}}, the code becomes much cleaner in my opinion, but apparently slower for reasons I don’t understand.

function rotate_sv(x, θ, axis)
    θ = deg2rad(θ)
    mean_pt = nanmean(x)
    R = AngleAxis(θ, axis...)
    return (Ref(R) .* (x .- Ref(mean_pt))) .+ Ref(mean_pt)
end
julia> θ = 90; axis = SA[1., 0., 0.];

julia> @benchmark rotate(x, $θ, $axis) setup=(x = randn(20700, 63))  # with .=
BenchmarkTools.Trial: 404 samples with 1 evaluation per sample.
 Range (min … max):  5.542 ms … 20.808 ms  ┊ GC (min … max):  0.00% … 72.45%
 Time  (median):     8.297 ms              ┊ GC (median):    30.26%
 Time  (mean ± σ):   8.256 ms ±  1.467 ms  ┊ GC (mean ± σ):  27.20% ± 11.48%

  ▃▃                 ▂██▆▄▁
  ███▇▁▆▁▁▁▁▁▁▁▁▁▁▁▁▁██████▅▇▄▇▅▇▄▅▄▁▁▄▁▁▄▁▁▁▄▄▁▁▁▁▄▁▁▁▁▄▄▅▆ ▇
  5.54 ms      Histogram: log(frequency) by time     13.1 ms <

 Memory estimate: 20.37 MiB, allocs estimate: 21.

julia> @benchmark rotate_sv(x, $θ, $axis) setup=(x = randn(SVector{3, Float64}, 20700, 21))
BenchmarkTools.Trial: 249 samples with 1 evaluation per sample.
 Range (min … max):  12.265 ms … 21.694 ms  ┊ GC (min … max): 0.00% … 11.86%
 Time  (median):     12.631 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   13.424 ms ±  1.329 ms  ┊ GC (mean ± σ):  4.85% ±  6.81%

  ▄█▇▆▃▂                  ▁▅▃▄▁
  ███████▅▅▆▆▆▇▁▆▅▅▁▅▆▁▅▁▅█████▇▁▇▅▆▇▇▆▅▅▅▁▅▁▁▅▅▁▅▁▁▁▁▁▅▁▁▁▁▅ ▆
  12.3 ms      Histogram: log(frequency) by time      17.4 ms <

 Memory estimate: 9.95 MiB, allocs estimate: 3.

Here I used BLAS.set_num_threads(1) for a fair comparison, but it doesn’t seem to make much of a difference.

1 Like

Some small improvements:

function rotate(x,θ,axis)
    θ = (θ*π)/180
    r,c = size(x)
    nm = div(c,3)
    x3 = reshape(x,r,3,nm)
    point = @views nanmean.([x3[:,k,:] for k in 1:3])
    R = AngleAxis(θ, axis...)
    r3 = similar(x3)
    point_reshaped = reshape(point, 1, 3)
    tmp = Matrix{Float64}(undef,r,3)
    @views for j in axes(x3, 3)
        tmp .= x3[:,:,j] .- point_reshaped
        mul!(r3[:,:,j], tmp, R')
        r3[:,:,j] .+= point_reshaped
    end
    x2 = reshape(r3,r,c)
end

It’s also a confounding factor to include the construction of the input vector in the benchmark, so better break it out and also use @btime interpolation. Another small improvement is to give the axis as a tuple.

x=randn(20700, 63);
@btime rotate($x, 90, (1, 0, 0));
1 Like

It seems that *(::AngleAxis, ::SVector{3}) is considerably slower than *(::SMatrix{3, 3}, ::SVector{3}). (Also, using an explicit loop instead of broadcasting seems to be some 10% faster.)

function rotate_sv2(x, θ, axis)
    θ = deg2rad(θ)
    mean_pt = nanmean(x)
    R = SMatrix(AngleAxis(θ, axis...))
    ret = similar(x)
    @inbounds for I = eachindex(x)
        ret[I] = R * (x[I] .- mean_pt) .+ mean_pt
    end
    return ret
end
julia> @benchmark rotate_sv2(x, $θ, $axis) setup=(x = randn(SVector{3, Float64}, 20700, 21))
BenchmarkTools.Trial: 477 samples with 1 evaluation per sample.
 Range (min … max):  2.961 ms … 18.198 ms  ┊ GC (min … max):  0.00% … 83.15%
 Time  (median):     3.078 ms              ┊ GC (median):     0.00%
 Time  (mean ± σ):   3.949 ms ±  1.520 ms  ┊ GC (mean ± σ):  18.68% ± 19.35%

  █▅                 ▅▃
  ██▇▆▁▁▁▁▄▅▁▄▁▁▁▁▁▁███▇▄▁▆▇▄▅▇▁▁▁▄▁▅▄▁▁▁▁▁▁▄▁▁▁▁▄▁▄▁▅▁▁▁▁▄▅ ▆
  2.96 ms      Histogram: log(frequency) by time      9.5 ms <

 Memory estimate: 9.95 MiB, allocs estimate: 3.
1 Like

I had a few minutes so I played around a little bit: you can remove some redundant computations by doing some linear algebra (see rotate3 below), but overall performance didn’t change on my device. You can probably squeeze out more by using more efficient memory access (you are iterating over the third axis now).

using Rotations, NaNStatistics, BenchmarkTools, LinearAlgebra, Random
Random.seed!(42) 

function rotate(x,θ,axis)
    θ = (θ*π)/180
    r,c = size(x)
    nm = div(c,3)
    x3 = reshape(x,r,3,nm)
    point = nanmean.([x3[:,k,:] for k in 1:3])
    R = AngleAxis(θ, axis...)
    r3 = similar(x3)
    point_reshaped = reshape(point, 1, 3)
    tmp = Matrix{Float64}(undef,r,3)
    @views for j in 1:nm
        tmp = x3[:,:,j] .- point_reshaped
        mul!(r3[:,:,j], tmp, R')
        r3[:,:,j] .+= point_reshaped
    end
    x2 = reshape(r3,r,c)
end
function rotate2(x,θ,axis)
    θ = (θ*π)/180
    r,c = size(x)
    nm = div(c,3)
    x3 = reshape(x,r,3,nm)
    point = @views nanmean.([x3[:,k,:] for k in 1:3])
    R = AngleAxis(θ, axis...)
    r3 = similar(x3)
    point_reshaped = reshape(point, 1, 3)
    tmp = Matrix{Float64}(undef,r,3)
    @views for j in axes(x3, 3)
        tmp .= x3[:,:,j] .- point_reshaped
        mul!(r3[:,:,j], tmp, R')
        r3[:,:,j] .+= point_reshaped
    end
    x2 = reshape(r3,r,c)
end
function rotate3(x,θ,axis)
    θ = (θ*π)/180
    r,c = size(x)
    nm = div(c,3)
    x3 = reshape(x,r,3,nm)
    point = @views nanmean.([x3[:,k,:] for k in 1:3])
    R = AngleAxis(θ, axis...)
    r3 = similar(x3)
    tmp = Matrix{Float64}(undef,r,3)
    @views for j in axes(x3, 3)
        tmp .= x3[:,:,j]
        mul!(r3[:,:,j], tmp, R')
    end
    r3 .+= reshape(point, 1, 3) - reshape(point, 1, 3) * R'
    x2 = reshape(r3,r,c)
end

function run_test()
    x = randn(20700,63)

    y1 = rotate(x, 90, [1 0 0]) # warmup
    y2 = rotate2(x, 90, [1 0 0]) # warmup
    y3 = rotate3(x, 90, [1 0 0]) # warmup
    @assert all(y1 .≈ y2 .≈ y3)
    display("all versions match")
    display(@benchmark rotate($x, 90, [1 0 0]))
    display(@benchmark rotate2($x, 90, [1 0 0]))
    display(@benchmark rotate3($x, 90, [1 0 0]))
end
run_test()
1 Like

Can you paste a MWE too? Thanks!

This is ideal, considering the slices x3[:,:,j] and r3[:,:,j] used in the matrix multiplication are then contiguous in memory.

Except for the imports from your code and an extra using StaticArrays, I’m not sure what you’re missing for my code above to already constitute a MWE?

I’m not able to run that code. Is x supposed to be a 2D matrix? nanmean(x) gives a scalar.

Well, x is a Matrix{SVector{3, Float64}}, like the ones created in the @benchmark’s setup:

x = randn(SVector{3, Float64}, 20700, 21)

Note in particular that the memory order is different between your x in rotate and mine in rotate_sv/rotate_sv2 (i.e.

julia> size(reinterpret(Float64, x))  # with x the `Matrix{SVector}` above
(62100, 21)

julia> size(reinterpret(reshape, Float64, x))
(3, 20700, 21)

compared to (20700, 63) and (20700, 3, 21) in your code).

Do you mean that this is contiguous because of the reshape? Usually indexing over the third index is not contiguous, right?

No, in general slicing along the last dimensions is in memory order for column-major ordering. For example,

julia> x = reshape(1:6, 2, 3)
2×3 reshape(::UnitRange{Int64}, 2, 3) with eltype Int64:
 1  3  5
 2  4  6

julia> x[:, 2]  # contiguous
2-element Vector{Int64}:
 3
 4

julia> x[2, :]  # non-contiguous
3-element Vector{Int64}:
 2
 4
 6

It is true that (normally) in nested loops you should consequently go ‘from right to left’

for k in axes(r3, 3)
    for j in axes(r3, 2)
        for i in axes(r3, 1)
            do_stuff_with(r3[i, j, k])
        end
    end
end

so that the most rapidly changing variable (i) is the left-most index in the array. If you want, you can interpret the mul!(r3[:,:,j], tmp, R') call as an inner loop over the first two dimensions.

1 Like

Thanks! It seems that rotate3 and rotate_sv2 are the fastest options. So @eldee, how can I convert an existing 2D matrix (say 20700x21) into a format that can be used with that function?

Happy new year! :slightly_smiling_face:

You could convert an x::Matrix{Float64} of size (r, 3nm) into xsv::Matrix{SVector{3, Float64}} of size (r, nm) via e.g.

xsv = SVector{3, Float64}.(eachslice(reshape(x, size(x, 1), 3, :), dims=(1, 3)))

but note that this conversion does not come for free:

julia> x = randn(20700, 63);

julia> @benchmark SVector{3, Float64}.(eachslice(reshape($x, size($x, 1), 3, :), dims=(1, 3)))
BenchmarkTools.Trial: 1780 samples with 1 evaluation per sample.
 Range (min … max):  1.910 ms … 17.696 ms  ┊ GC (min … max):  0.00% … 88.78%
 Time  (median):     2.047 ms              ┊ GC (median):     0.00%
 Time  (mean ± σ):   2.800 ms ±  1.324 ms  ┊ GC (mean ± σ):  26.49% ± 25.22%

  █▇▆▄▃▂▁                              ▁▃▁▁   ▃▃▃▂▁▁▁        ▁
  ████████▇▇▇█▇▅▆▄▄▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▄▇█████▇████████▇▇▇▇▆▆▆ █
  1.91 ms      Histogram: log(frequency) by time     5.59 ms <

 Memory estimate: 9.95 MiB, allocs estimate: 4.

(Constructing xsv by explicitly looping is equally fast, while using permutedims and reinterpret is slower.)

So I would advise you to try to work with this representation from the start, if possible. Most of the time it also makes the most sense to put the xyz-axis in the first dimension, either explicitly as a (3, r, nm) Array or implicitly via SVector{3}s (which is exactly the same in memory, just one reinterpret away).

You could of course also use JADekker’s trick with SVectors, but also here I don’t see any performance improvement.

Exploiting distributivity
julia> function rotate_sv3(x, θ, axis)
           θ = deg2rad(θ)
           mean_pt = nanmean(x)
           R = SMatrix(AngleAxis(θ, axis...))
           total_mean_pt_shift = mean_pt .- R * mean_pt
           ret = similar(x)
           @inbounds for I = eachindex(x)
               ret[I] = R * x[I] .+ total_mean_pt_shift
           end
           return ret
       end
rotate_sv3 (generic function with 1 method)

julia> @benchmark rotate_sv3(x, $θ, $axis) setup=(x = randn(SVector{3, Float64}, 20700, 21))
BenchmarkTools.Trial: 481 samples with 1 evaluation per sample.
 Range (min … max):  2.920 ms … 11.639 ms  ┊ GC (min … max):  0.00% … 71.51%
 Time  (median):     3.085 ms              ┊ GC (median):     0.00%
 Time  (mean ± σ):   3.722 ms ±  1.069 ms  ┊ GC (mean ± σ):  17.26% ± 18.56%

  ██▅▄▂▃▁                         ▅▄▃▂▁▁
  ███████▆▇▄▁▄▅▄▁▅▆▄▁▁▁▁▁▁▁▁▁▁▁▁▁▄██████▇▇▇▄▁▅▁▄▁▆▄▁▄▁▁▅▄▄▁▄ ▇
  2.92 ms      Histogram: log(frequency) by time     6.42 ms <

 Memory estimate: 9.95 MiB, allocs estimate: 3.

Happy New Year!

1 Like

Inded, the difference in performance between sv2 and sv3 is negligible. I’m sticking with the 2D array for pedagogical reasons.

Thanks to all for the great advice! :slightly_smiling_face: