using StrideArrays, LoopVectorization
function test2(n)
a = ones(Int64, n, n)
a .+= 2
b = sum(a, dims=2)
b .*= 3
b .+= 0
return b
end
function test3(::Val{N}) where {N}
_a = StrideArray{Int64}(undef, (StaticInt(N),StaticInt(N)))
b = StrideArrays.preserve_buffer(_a)
a = PtrArray(_a)
GC.@preserve begin b
@avx a .= 1
@avx a .+= 2
b = Matrix{Int}(undef, size(a,1), 1)
@avx for j in axes(a,2)
bi = 0
for i in axes(a,1)
bi += a[i,j]
end
b[i] = bi
end
end # GC.@preserve
@avx b .*= 3
@avx b .+= 0
return b
end
I get
julia> @btime test2(100)'
10.006 ÎĽs (7 allocations: 79.16 KiB)
1Ă—100 adjoint(::Matrix{Int64}) with eltype Int64:
900 900 900 900 900 900 900 900 900 900 900 900 900 900 900 900 900 900 900 900 … 900 900 900 900 900 900 900 900 900 900 900 900 900 900 900 900 900 900 900 900
julia> @btime test3(Val(100))'
3.413 ÎĽs (1 allocation: 896 bytes)
1Ă—100 adjoint(::Matrix{Int64}) with eltype Int64:
900 900 900 900 900 900 900 900 900 900 900 900 900 900 900 900 900 900 900 900 … 900 900 900 900 900 900 900 900 900 900 900 900 900 900 900 900 900 900 900 900
About 3x faster, and a tiny fraction of the memory use.
This will only work up to a certain size. You probably can’t stack allocate a 1000x1000 array in Julia – the stack isn’t big enough, so you’ll probably get a crash.
It also has to be statically sized.
In Fortran, it’s easy to set a bigger stack size and to stack allocate dynamically sized arrays. Hopefully we’ll get there sooner rather than later.
I was a bit heavy handed with test3
. Something wasn’t working right somewhere, so I’ll have to take more of a look at it later.
StrideArrays is far from mature – I haven’t used it much myself yet – so it needs a lot of ironing out.
If you pin down any problematic functions that don’t work (and which aren’t solved with @gc_preserve
, please file an issue.
sum(a,dims=2)
apparently isn’t type stable when used with StrideArray
s, for example…
EDIT:
There aren’t any semantic problems. It’s just that (currently – will probably be fixed eventually) allocating base arrays is opaque to the compiler. mutable struct
s aren’t, so you can just create a mutable struct
holding an NTuple
for memory, and the compiler can optimize it, like in the above example with just 1 allocation.
But it’s fairly easy to mess up at the moment.