Performance optimization on lots of small linear algebra operations

vfonov · July 6, 2022, 6:48pm

I stumble on a problem with execution speed when I run lots of simple linear algebra operations in a loop, where using general Arrays performs faster then StaticArrays, but still not as fast as a pure C implementation.
The whole version of the code is just performing resampling of 3D volume using affine transformation, the time to execute equivalent C code is about 0.5 sec.

Here is (simplified version of my code) (version using Array):


struct AffineTransform
    rot::Matrix{Float64}
    shift::Vector{Float64}
end

function AffineTransform()
    AffineTransform([1.0 0.0 0.0 ;0.0 1.0 0.0 ;0.0 0.0 1.0 ],[0.0,0.0,0.0])
end

function AffineTransform(mat::Matrix{Float64})
    AffineTransform(mat[1:3,1:3],mat[1:3,4])
end

function transform_point(tfm::AffineTransform, 
    p::Vector{Float64})::Vector{Float64}

    (p' * tfm.rot)' + tfm.shift
end

Executing it in a loop with @timev:

  9.634290 seconds (104.20 M allocations: 7.794 GiB, 11.55% gc time, 7.36% compilation time)
elapsed time (ns): 9634290191
gc time (ns):      1112299564
bytes allocated:   8368819384
pool allocs:       104201914
non-pool GC allocs:512
malloc() calls:    1
GC pauses:         72

When executed with --track-allocation=user Coverage shows that most of memory is allocated in transform_point function

Second version, using StaticArrays:

using StaticArray

struct AffineTransform
    rot::SMatrix{3,3,Float64}
    shift::SVector{3,Float64}
end

function AffineTransform()
    AffineTransform(SA_F64[1.0 0.0 0.0 ;0.0 1.0 0.0 ;0.0 0.0 1.0 ],SA_F64[0.0,0.0,0.0])
end

function AffineTransform(mat::Matrix{Float64})
    AffineTransform(mat[1:3,1:3],mat[1:3,4])
end

function transform_point(tfm::AffineTransform, 
    p::SVector{3,Float64})::SVector{3,Float64}

    (p' * tfm.rot)' + tfm.shift
end

Executing with @timev shows:

 15.742327 seconds (131.19 M allocations: 4.041 GiB, 4.12% gc time, 7.11% compilation time)
elapsed time (ns): 15742327259
gc time (ns):      649276769
bytes allocated:   4338777545
pool allocs:       131187797
non-pool GC allocs:728
malloc() calls:    67
realloc() calls:   8
GC pauses:         36

Again, running with --track-allocation=user shows that memory is mostly allocated in transform_point

So, what am I doing wrong ?

stevengj · July 6, 2022, 7:29pm

Abstractly typed

vfonov · July 6, 2022, 7:31pm

so, replacing it to rot::SMatrix{3,3,Float64,9} speeds things up:

2.134522 seconds (3.27 M allocations: 232.767 MiB, 2.67% gc time, 72.24% compilation time)
elapsed time (ns): 2134522290
gc time (ns):      57068381
bytes allocated:   244073420
pool allocs:       3270803
non-pool GC allocs:740
malloc() calls:    68
GC pauses:         3

lmiq · July 6, 2022, 7:53pm

It will be easier for people to help if you put an exact running version of what you are benchmarking.

roflmaostc · July 6, 2022, 7:54pm

As the result indicates, it measures a lot compilation time.

Try to use BenchmarkTools.jl which avoids effects from global variables and compilation:

julia> x = randn((3,))
3-element Vector{Float64}:
  2.0329917778217044
 -1.7537450740457912
 -0.7547097484781418

julia> @btime transform_point(AffineTransform(), $x)
  303.315 ns (5 allocations: 448 bytes)
3-element Vector{Float64}:
  2.0329917778217044
 -1.7537450740457912
 -0.7547097484781418

DNF · July 6, 2022, 7:54pm

What are you executing? This is just a bunch of definitions.

vfonov · July 6, 2022, 7:58pm

So, my full version is a little bit more complicated:

Version with Array: Minc2.jl/resample.jl at main · vfonov/Minc2.jl · GitHub
Version with StaticArrays: Minc2.jl/resample.jl at StaticArrays · vfonov/Minc2.jl · GitHub

It has lots of other dependecies, so I tried to show only the parts, that are important (I think)

DNF · July 7, 2022, 7:17am

This is a complicated piece of code, at least for quick message board feedback.

If you are interested in getting your code closer to C performance, maybe you could extract and post a minimal working example, that people can just cut and paste straight into their REPL? That code should include the major hot loops (and no argparse, or file reading, or anything like that, make sure to include code to generate random points).

From glancing at your code, I see there is a lot of unnecessary complexity, lots of creating Arrays that then get converted to StaticArrays. And, everywhere, overcomplicated type signatures and assertions, that are simultaneously restrictive and inefficient.

Here’s an example of how to make your struct definition of AffineTransform much more efficient and more general with less code:

struct AffineTransform{T} 
    rot::SMatrix{3,3,T,9}
    shift::SVector{3,T}
end
function AffineTransform(rot, shift)
    ind = SA[1, 2, 3]
    return AffineTransform(rot[ind, ind], shift[ind])
end
function AffineTransform(rot)
    ind = SA[1, 2, 3]
    return AffineTransform(rot[ind, ind], rot[ind, 4])
end

These constructors will work for both regular Array and StaticArray inputs (and many other AbstractArrays), and are 20-40x faster than your version (though, I guess this is not really your bottleneck).

(The key here, is to realize that when you index into an array with a StaticArray, you get a StaticArray:

julia> x = rand(5); # regular array

julia> x[SA[1,2,3]]
3-element SVector{3, Float64} with indices SOneTo(3):
 0.1896515325909145
 0.41007884427545227
 0.3283262466574635

efficiently, and with zero allocations)

vfonov · July 7, 2022, 2:09pm

Thank you!

Current version of the code with StaticArrays is already pretty close to the original C , after compilation. But simplifying the code with indexing will help too!

vfonov · July 7, 2022, 3:53pm

I created a minimally working example: Minimally working example of AffineTransform and transform_point · GitHub

vfonov · July 7, 2022, 3:54pm

Results of running on my laptop with julia 1.6.6 :

  1.619801 seconds (1.53 M allocations: 88.766 MiB, 43.37% compilation time)
elapsed time (ns): 1619800566
bytes allocated:   93077492
pool allocs:       1534106
non-pool GC allocs:235

stillyslalom · July 7, 2022, 7:50pm

The first run of a given function includes compilation time, and that’s much of what you’re reporting here. From the performance tips:

On the first call (…) the function gets compiled. (If you’ve not yet used @time in this session, it will also compile functions needed for timing.) You should not take the results of this run seriously.

On my machine, the second call takes less than half as long as the first:

julia> @time resample_volume(out_vol,i2w,w2i,tfm,itp)
  0.649933 seconds (1.61 M allocations: 82.148 MiB, 55.32% compilation time)

julia> @time resample_volume(out_vol,i2w,w2i,tfm,itp)
  0.267742 seconds

You may also want to take a look at CoordinateTransformations.jl instead of rolling your own.

vfonov · July 8, 2022, 12:10pm

Thank you for the suggestion, perhaps I will integrate my approach wih the CoordinateTransformations. In reality I need non-linear transformations, such as a free-form non-linear (already implemented) where transformations are represented as a field of 3d displacement vectors : https://github.com/vfonov/Minc2.jl/blob/main/src/geo_tools.jl#L147
Perhaps you could recommend some package for that too?

GregVernon · July 8, 2022, 9:11pm

I don’t know if this will be helpful, but several years ago I got some fantastic help for what seems to be at least tangentially related to what you’re doing – so perhaps you’ll find this helpful:

Topic		Replies	Views
Are StaticArrays always faster? Performance matrices , staticarrays	2	552	January 1, 2024
StaticArrays compilation performance bug Performance staticarrays	4	443	July 18, 2021
Strange allocation Performance	2	366	July 17, 2019
Performance details of StaticArray Performance	5	3070	June 7, 2018
Code becomes really slow when I combine StaticArrays and DoubleFloats Performance staticarrays , doublefloats	18	583	September 23, 2022

Performance optimization on lots of small linear algebra operations

Related topics