Hello, I have 3 arrays (x,y,w)
where I want elementwise y += x * w
but the specific element indexes are stored in 3 separate vectors of SArrays, x_ids
, y_ids
and w_ids
.
The problem is that the operation hugely allocates, and I wander if there are better ways without allocation.
Here is a MWE… I also added a version where the indexes are SVector of SVector, but (1) the computation is even slower, (2) I don’t know the length of the indexes a priori (but when the positional indexed are computed they don’t change anymore), (3) creating the object with the index is very slow and I believe I risk out of memory errors with slightly larger index vectors .
using StaticArrays, BenchmarkTools
mutable struct ConvLayer
x_ids::Array{SVector{3,Int32}}
y_ids::Array{SVector{3,Int32}}
w_ids::Array{SVector{4,Int32}}
end
mutable struct ConvLayerSV # an all-static alternative
x_ids::SVector{3000, SVector{3, Int32}}
y_ids::SVector{3000, SVector{3, Int32}}
w_ids::SVector{3000, SVector{4, Int32}}
end
# Data generation for the MWE...
x = rand(64,64,3)
y = zeros(32,32,5)
w = rand(4,4,3,5)
N = 3000
x_ids = [SVector{3,Int32}([rand(1:id) for id in size(x)]...) for n in 1:N]
y_ids = [SVector{3,Int32}([rand(1:id) for id in size(y)]...) for n in 1:N]
w_ids = [SVector{4,Int32}([rand(1:id) for id in size(w)]...) for n in 1:N]
layer = ConvLayer(x_ids, y_ids, w_ids)
layerSV = ConvLayerSV(x_ids, y_ids, w_ids) # long time here!
function compute!(y,l,x)
for i in 1:length(l.y_ids)
y[l.y_ids[i][1],l.y_ids[i][2],l.y_ids[i][3]] +=
x[l.x_ids[i][1],l.x_ids[i][2],l.x_ids[i][3]] *
w[l.w_ids[i][1],l.w_ids[i][2],l.w_ids[i][3],l.w_ids[i][4]]
end
return nothing
end
# The computation that I care...
@btime compute!($y,$layer,$x) # 7.542 ms (117890 allocations: 2.71 MiB)
@btime compute!($y,$layerSV,$x) # 42.124 ms ( 63000 allocations: 1.42 MiB)
Obviously, the context is of a convolution, and I am aware of packages like FourierTools.jl or Stencils.jl, but I was wandering if one could optimize this approach without allocation.
The idea is to store information about boundaries, padding, stride, etc. in a “precompile” step where the indexes are first created and then “just” loop over these indexes…