I stumble on a problem with execution speed when I run lots of simple linear algebra operations in a loop, where using general Arrays performs faster then StaticArrays, but still not as fast as a pure C implementation.
The whole version of the code is just performing resampling of 3D volume using affine transformation, the time to execute equivalent C code is about 0.5 sec.
Here is (simplified version of my code) (version using Array):
struct AffineTransform
rot::Matrix{Float64}
shift::Vector{Float64}
end
function AffineTransform()
AffineTransform([1.0 0.0 0.0 ;0.0 1.0 0.0 ;0.0 0.0 1.0 ],[0.0,0.0,0.0])
end
function AffineTransform(mat::Matrix{Float64})
AffineTransform(mat[1:3,1:3],mat[1:3,4])
end
function transform_point(tfm::AffineTransform,
p::Vector{Float64})::Vector{Float64}
(p' * tfm.rot)' + tfm.shift
end
Executing it in a loop with @timev:
9.634290 seconds (104.20 M allocations: 7.794 GiB, 11.55% gc time, 7.36% compilation time)
elapsed time (ns): 9634290191
gc time (ns): 1112299564
bytes allocated: 8368819384
pool allocs: 104201914
non-pool GC allocs:512
malloc() calls: 1
GC pauses: 72
When executed with --track-allocation=user
Coverage shows that most of memory is allocated in transform_point
function
Second version, using StaticArrays:
using StaticArray
struct AffineTransform
rot::SMatrix{3,3,Float64}
shift::SVector{3,Float64}
end
function AffineTransform()
AffineTransform(SA_F64[1.0 0.0 0.0 ;0.0 1.0 0.0 ;0.0 0.0 1.0 ],SA_F64[0.0,0.0,0.0])
end
function AffineTransform(mat::Matrix{Float64})
AffineTransform(mat[1:3,1:3],mat[1:3,4])
end
function transform_point(tfm::AffineTransform,
p::SVector{3,Float64})::SVector{3,Float64}
(p' * tfm.rot)' + tfm.shift
end
Executing with @timev shows:
15.742327 seconds (131.19 M allocations: 4.041 GiB, 4.12% gc time, 7.11% compilation time)
elapsed time (ns): 15742327259
gc time (ns): 649276769
bytes allocated: 4338777545
pool allocs: 131187797
non-pool GC allocs:728
malloc() calls: 67
realloc() calls: 8
GC pauses: 36
Again, running with --track-allocation=user
shows that memory is mostly allocated in transform_point
So, what am I doing wrong ?