It is possible in Julia:
using BenchmarkTools, LoopVectorization, PaddedMatrices#v0.2.1
@noinline function dostuff(A, B)
C = A*B
s = zero(eltype(C))
@avx for i in eachindex(C)
s += C[i]
end
s
end
function main_v1()
A = @StrideArray rand(8,8);
B = @StrideArray rand(8,8);
dostuff(A, B)
end
function main_v2()
A = @StrideArray rand(8,8);
B = @StrideArray rand(8,8);
@gc_preserve dostuff(A, B)
end
@benchmark main_v1()
@benchmark main_v2()
Because a StrideArray is your typical mutable array:
julia> A = @StrideArray rand(2,5)
2×5 StrideMatrix{Tuple{StaticInt{2}, StaticInt{5}}, (true, true), Float64, 1, 0, (1, 2), Tuple{StaticInt{8}, StaticInt{16}}, Tuple{StaticInt{1}, StaticInt{1}}, PaddedMatrices.MemoryBuffer{10, Float64}} with indices StaticInt{1}():StaticInt{1}():StaticInt{2}()×StaticInt{1}():StaticInt{1}():StaticInt{5}():
0.429701 0.318488 0.842704 0.0217103 0.212563
0.82351 0.245693 0.890502 0.941539 0.626707
julia> A[1,3] = 8;
julia> A
2×5 StrideMatrix{Tuple{StaticInt{2}, StaticInt{5}}, (true, true), Float64, 1, 0, (1, 2), Tuple{StaticInt{8}, StaticInt{16}}, Tuple{StaticInt{1}, StaticInt{1}}, PaddedMatrices.MemoryBuffer{10, Float64}} with indices StaticInt{1}():StaticInt{1}():StaticInt{2}()×StaticInt{1}():StaticInt{1}():StaticInt{5}():
0.429701 0.318488 8.0 0.0217103 0.212563
0.82351 0.245693 0.890502 0.941539 0.626707
main_v1 of course causes allocations:
julia> @benchmark main_v1()
BenchmarkTools.Trial:
memory estimate: 1.06 KiB
allocs estimate: 2
--------------
minimum time: 71.279 ns (0.00% GC)
median time: 91.891 ns (0.00% GC)
mean time: 101.286 ns (10.74% GC)
maximum time: 924.190 ns (77.74% GC)
--------------
samples: 10000
evals/sample: 974
But main_v2 stack allocates them:
julia> @benchmark main_v2()
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 40.155 ns (0.00% GC)
median time: 40.206 ns (0.00% GC)
mean time: 40.346 ns (0.00% GC)
maximum time: 57.046 ns (0.00% GC)
--------------
samples: 10000
evals/sample: 990
Also, for good measure, using immutable structs will normally be stack allocated:
using StaticArrays
function main_v3()
A = @SMatrix rand(8,8);
B = @SMatrix rand(8,8);
dostuff(A, B)
end
@benchmark main_v3()
yielding
julia> @benchmark main_v3()
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 186.780 ns (0.00% GC)
median time: 187.906 ns (0.00% GC)
mean time: 188.042 ns (0.00% GC)
maximum time: 203.646 ns (0.00% GC)
--------------
samples: 10000
evals/sample: 891
The macro PaddedMatrices.@gc_preserve works by
- Using
GC.@preserveon all arguments to the call - Tries to replace
AbstractArrays with aPtrArraythat holds a pointer to the original array, along with holding its size, strides, and offsets.
GC.@preserve protects the memory from getting collected, but it is still stack allocated as it cannot escape; the array itself is replaced with the PtrArray. Of course, you as the user have to guarantee that the PtrArray doesn’t escape, as it’s only valid for as long as GC.@preserve protects the data.
You could take a similar approach with whatever data structures you need.
Also, you should be able to avoid all dynamic dispatches. Try creating branches at the point a type could take one more than one value, and immediately call into those functions. That is, instead of
function foo(args...)
# do stuff
if hit_metalic_object
thing_its_reflecting_off_of = MetalType()
elseif #...
thing_its_reflecting_off_of = #...
else #...
# ...
end
# computations continue
end
do something like
Now, I don’t know what your code looks like
function foo(args...)
# do stuff
if hit_metalic_object
foo_continued(MetalType(), args...)
elseif #...
# ...
else #...
# ...
end
end
function foo_continued(thing_its_reflecting_off_of, args...)
# computations continue
end
I don’t know what your code looks like, but you should be able to restructure/organize it in a way to avoid dynamic dispatches.