I would focus on the desired output and try to achieve that as straightforwardly as possible, rather than trying to port some other implementation designed for a system with other performance characteristics. In this case it’s all a matter of filling an array with entries from a vector, so just do a double loop.
function buffer(X, n, p)
0 <= p < n || error("You must have 0 <= p < n.")
m = cld(length(X), n - p)
out = zeros(eltype(X), n, m)
for j = 1:m
for i = 1:n
k = (j - 1) * (n - p) + i - p
if k in eachindex(X)
out[i, j] = X[k]
end
end
end
return out
end