It is possible in Julia:
using BenchmarkTools, LoopVectorization, PaddedMatrices#v0.2.1
@noinline function dostuff(A, B)
C = A*B
s = zero(eltype(C))
@avx for i in eachindex(C)
s += C[i]
end
s
end
function main_v1()
A = @StrideArray rand(8,8);
B = @StrideArray rand(8,8);
dostuff(A, B)
end
function main_v2()
A = @StrideArray rand(8,8);
B = @StrideArray rand(8,8);
@gc_preserve dostuff(A, B)
end
@benchmark main_v1()
@benchmark main_v2()
Because a StrideArray
is your typical mutable array:
julia> A = @StrideArray rand(2,5)
2×5 StrideMatrix{Tuple{StaticInt{2}, StaticInt{5}}, (true, true), Float64, 1, 0, (1, 2), Tuple{StaticInt{8}, StaticInt{16}}, Tuple{StaticInt{1}, StaticInt{1}}, PaddedMatrices.MemoryBuffer{10, Float64}} with indices StaticInt{1}():StaticInt{1}():StaticInt{2}()×StaticInt{1}():StaticInt{1}():StaticInt{5}():
0.429701 0.318488 0.842704 0.0217103 0.212563
0.82351 0.245693 0.890502 0.941539 0.626707
julia> A[1,3] = 8;
julia> A
2×5 StrideMatrix{Tuple{StaticInt{2}, StaticInt{5}}, (true, true), Float64, 1, 0, (1, 2), Tuple{StaticInt{8}, StaticInt{16}}, Tuple{StaticInt{1}, StaticInt{1}}, PaddedMatrices.MemoryBuffer{10, Float64}} with indices StaticInt{1}():StaticInt{1}():StaticInt{2}()×StaticInt{1}():StaticInt{1}():StaticInt{5}():
0.429701 0.318488 8.0 0.0217103 0.212563
0.82351 0.245693 0.890502 0.941539 0.626707
main_v1
of course causes allocations:
julia> @benchmark main_v1()
BenchmarkTools.Trial:
memory estimate: 1.06 KiB
allocs estimate: 2
--------------
minimum time: 71.279 ns (0.00% GC)
median time: 91.891 ns (0.00% GC)
mean time: 101.286 ns (10.74% GC)
maximum time: 924.190 ns (77.74% GC)
--------------
samples: 10000
evals/sample: 974
But main_v2
stack allocates them:
julia> @benchmark main_v2()
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 40.155 ns (0.00% GC)
median time: 40.206 ns (0.00% GC)
mean time: 40.346 ns (0.00% GC)
maximum time: 57.046 ns (0.00% GC)
--------------
samples: 10000
evals/sample: 990
Also, for good measure, using immutable struct
s will normally be stack allocated:
using StaticArrays
function main_v3()
A = @SMatrix rand(8,8);
B = @SMatrix rand(8,8);
dostuff(A, B)
end
@benchmark main_v3()
yielding
julia> @benchmark main_v3()
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 186.780 ns (0.00% GC)
median time: 187.906 ns (0.00% GC)
mean time: 188.042 ns (0.00% GC)
maximum time: 203.646 ns (0.00% GC)
--------------
samples: 10000
evals/sample: 891
The macro PaddedMatrices.@gc_preserve
works by
- Using
GC.@preserve
on all arguments to the call - Tries to replace
AbstractArray
s with aPtrArray
that holds a pointer to the original array, along with holding its size, strides, and offsets.
GC.@preserve
protects the memory from getting collected, but it is still stack allocated as it cannot escape; the array itself is replaced with the PtrArray
. Of course, you as the user have to guarantee that the PtrArray
doesn’t escape, as it’s only valid for as long as GC.@preserve
protects the data.
You could take a similar approach with whatever data structures you need.
Also, you should be able to avoid all dynamic dispatches. Try creating branches at the point a type could take one more than one value, and immediately call into those functions. That is, instead of
function foo(args...)
# do stuff
if hit_metalic_object
thing_its_reflecting_off_of = MetalType()
elseif #...
thing_its_reflecting_off_of = #...
else #...
# ...
end
# computations continue
end
do something like
Now, I don’t know what your code looks like
function foo(args...)
# do stuff
if hit_metalic_object
foo_continued(MetalType(), args...)
elseif #...
# ...
else #...
# ...
end
end
function foo_continued(thing_its_reflecting_off_of, args...)
# computations continue
end
I don’t know what your code looks like, but you should be able to restructure/organize it in a way to avoid dynamic dispatches.