I’m trying to translate in Julia the doubleCenter function from this article. The function takes a matrix as an argument, and doubly centers it, meaning that it subtracts to each element the mean of its row and the mean of its column and finally adds the mean of all the entries.
So far, I came up with this implementation, but I need to make it faster since it is in a critical loop:

using Distributions, BenchmarkTools
# pre allocate the array where the result will be stored
double_centered = Array{Float64,2}(undef, 1000, 1000)
# create fake data
const S = rand(Uniform(1.0, 10.0), 1000, 1000)
# function to be optimized
function double_center!(double_centered::Array{Float64,2}, M::Array{Float64,2})
# evaluate the averages of all rows
@views row_means = mean([M[i,:] for i in 1:size(M, 1) ])
# evaluate the averages of all columns
@views col_means = mean([M[:,j] for j in 1:size(M, 2) ])
# evaluate the mean of the matrix
grand_mean::Float64 = mean(M)
# perform double centering
@. @views double_centered = ((M - row_means' ) - col_means) + grand_mean
end
# benchmark
@btime double_center!(double_centered, S);

2.775 ms (2006 allocations: 15.59 MiB)

I did my best, but couldn’t really write it much faster than this. Could you please help me? It would be great to save at least an order or two of magnitude.

Multi-threaded is nice but depends on how many cores you have available. We can get rid of an entire iteration by computing both means jointly (and the grand mean from one of those):

julia> function double_center!5(double_centered::Matrix{Float64}, M::Matrix{Float64})
m,n = size(M)
rowmeans = zeros(Float64, m)
colmeans = zeros(Float64, n)
for j in 1:n, i in 1:m
rowmeans[i] += M[i,j]
colmeans[j] += M[i,j]
end
grandmean = mean(rowmeans)
double_centered .= M .- rowmeans' .- colmeans .+ grandmean
end
double_center!5 (generic function with 1 method)
julia> function double_center!6(double_centered::Matrix{Float64}, M::Matrix{Float64})
m,n = size(M)
rowmeans = zeros(Float64, m)
colmeans = zeros(Float64, n)
for (i,col) in enumerate(eachcol(M))
rowmeans .+= col
colmeans[i] = mean(col)
end
grandmean = mean(rowmeans)
double_centered .= M .- rowmeans' .- colmeans .+ grandmean
end
double_center!6 (generic function with 1 method)

For:

julia> @btime double_center!4(double_centered, S);
2.788 ms (73 allocations: 21.09 KiB)
julia> @btime double_center!(double_centered, S);
11.016 ms (2006 allocations: 15.59 MiB)
julia> @btime double_center!2(double_centered, S);
4.047 ms (18 allocations: 17.14 KiB)
julia> @btime double_center!4(double_centered, S);
2.789 ms (73 allocations: 21.09 KiB) # because my laptop has only 4 threads :(
julia> @btime double_center!5(double_centered, S);
5.571 ms (2 allocations: 15.88 KiB)
julia> @btime double_center!6(double_centered, S);
2.771 ms (2 allocations: 15.88 KiB)

I think double_center!6 may need rowmeans ./= n, and could use @inbounds. In all of these, most of the time seems to be on the last line:

julia> @btime double_center!6(double_centered, S);
1.947 ms (2 allocations: 15.88 KiB)
julia> @btime double_center!6_minius(double_centered, S); # without the last line
409.064 μs (2 allocations: 15.88 KiB)
julia> @btime double_center!4_minus(double_centered, S); # without the last line
358.813 μs (24 allocations: 18.41 KiB)
julia> @btime mean($S, dims=2); # just one mean, the slower direction
322.528 μs (11 allocations: 8.61 KiB)

Yes, but it’s tripping over its own feet here, I think. Adjusting Tullio.TILE[] = 10_000 to disable its attempt at tiled access (via an internal option which might change) brings this to 387.131 μs. About half the @avx time, on a 2-core computer.

Well spotted, totally thought about that and forget to add it, sorry! @inbounds is a good idea, may as well try to add multithreading to the basis of that and see what happens

Tried to combine the iter-only-once approach with multi-threading with FLoops.jl:

julia> function dc!7minus(double_centered::Matrix{Float64}, M::Matrix{Float64})
m,n = size(M)
@floop for col in eachcol(M)
local m = mean(col)
@reduce() do (rowmeans=zeros(Float64,m); col), (colmeans=Float64[]; m)
colmeans = append!(colmeans, m)
rowmeans .+= col
end
end
grandmean = mean(colmeans)
rowmeans ./= n
(rowmeans, colmeans, grandmean)
end

These are all as above with the final line removed each time as suggested by @mcabbott. Combining this with one of the fast versions for the last line should give a nice overall result at last. I also just discovered that I ran with nthreads == 1 which explains the missing speedup, will retry.

EDIT: Using 4 threads instead of 1 and fixing the mutation in the last line to use @strided and reshaped instead of transposed rowmean, all approaches seem to be more or less on par. Someone with a less pitiful setup than mine may take a look at the scaling with more cores, but it seems fine to me as is.

julia> @btime double_center!(double_centered, S);
14.453 ms (2058 allocations: 15.60 MiB)
julia> @btime double_center!6(double_centered, S);
2.110 ms (50 allocations: 19.38 KiB)
julia> @btime double_center!4(double_centered, S);
2.041 ms (89 allocations: 22.53 KiB)
julia> @btime double_center!7(double_centered, S);
2.041 ms (285 allocations: 77.53 KiB)