Looking at a few examples, I have figured out a way to do this for square matrices
Python examples:
In [28]: n = 4; Z = np.reshape(np.arange(n**2), (n, n));
In [29]: k = 2;
In [30]: np.add.reduceat(np.add.reduceat(Z, np.arange(0, Z.shape[0], k), axis=0),np.arange(0, Z.shape[1], k), axis=1)
Out[30]:
array([[10, 18],
[42, 50]])
In [31]: k = 3;
In [32]: np.add.reduceat(np.add.reduceat(Z, np.arange(0, Z.shape[0], k), axis=0),np.arange(0, Z.shape[1], k), axis=1)
Out[32]:
array([[45, 21],
[39, 15]])
In [33]: n = 5; Z = np.reshape(np.arange(n**2), (n, n));
In [34]: k = 2;
In [35]: np.add.reduceat(np.add.reduceat(Z, np.arange(0, Z.shape[0], k), axis=0),np.arange(0, Z.shape[1], k), axis=1)
Out[35]:
array([[12, 20, 13],
[52, 60, 33],
[41, 45, 24]])
In [36]: k = 3;
In [37]: np.add.reduceat(np.add.reduceat(Z, np.arange(0, Z.shape[0], k), axis=0),np.arange(0, Z.shape[1], k), axis=1)
Out[37]:
array([[ 54, 51],
[111, 84]])
In julia:
julia> function addreduceat(Z, k)
inds = Iterators.partition(axes(Z,1), k)
[sum(@view Z[CartesianIndices((I,J))]) for I in inds, J in inds]
end
addreduceat (generic function with 2 methods)
julia> n = 4; Z = collect(reshape(0:n^2-1, n, n)');
julia> addreduceat(Z, 2)
2×2 Matrix{Int64}:
10 18
42 50
julia> addreduceat(Z, 3)
2×2 Matrix{Int64}:
45 21
39 15
julia> n = 5; Z = collect(reshape(0:n^2-1, n, n)');
julia> addreduceat(Z, 2)
3×3 Matrix{Int64}:
12 20 13
52 60 33
41 45 24
julia> addreduceat(Z, 3)
2×2 Matrix{Int64}:
54 51
111 84
Interestingly, here are the execution times:
In [38]: %timeit np.add.reduceat(np.add.reduceat(Z, np.arange(0, Z.shape[0], k), axis=0),np.arange(0, Z.shape[1], k), axis=1)
7.04 µs ± 102 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
julia> @btime addreduceat($Z, 3);
243.859 ns (1 allocation: 112 bytes)