# Help with increasing performance of specific function

Hello,

I’m trying to translate in Julia the `doubleCenter` function from this article. The function takes a matrix as an argument, and doubly centers it, meaning that it subtracts to each element the mean of its row and the mean of its column and finally adds the mean of all the entries.
So far, I came up with this implementation, but I need to make it faster since it is in a critical loop:

``````using Distributions, BenchmarkTools

# pre allocate the array where the result will be stored
double_centered = Array{Float64,2}(undef, 1000, 1000)

# create fake data
const S = rand(Uniform(1.0, 10.0), 1000, 1000)

# function to be optimized
function double_center!(double_centered::Array{Float64,2}, M::Array{Float64,2})

# evaluate the averages of all rows
@views row_means = mean([M[i,:] for i in 1:size(M, 1) ])

# evaluate the averages of all columns
@views col_means =  mean([M[:,j] for j in 1:size(M, 2) ])

# evaluate the mean of the matrix
grand_mean::Float64 = mean(M)

# perform double centering
@. @views double_centered = ((M - row_means' ) - col_means) + grand_mean

end

# benchmark
@btime double_center!(double_centered, S);
``````
``````2.775 ms (2006 allocations: 15.59 MiB)
``````

I did my best, but couldn’t really write it much faster than this. Could you please help me? It would be great to save at least an order or two of magnitude.

1 Like

using `mean(M, dims=1)` and `mean(M, dims=2)` makes this about 2x faster on my computer.

4 Likes
``````function double_center!2(double_centered::Array{Float64,2}, M::Array{Float64,2})
@views double_centered .= ((M .- mean(M, dims = 1) ) .- mean(M, dims = 2)) .+ mean(M)
end
``````

yields:

``````9.021 ms (2006 allocations: 15.59 MiB) # original
3.409 ms (18 allocations: 17.14 KiB) # update
``````

Edit: Yes, what Oscar says 5 Likes

You can save a little more by computing the overall mean from one of the others. And can try multi-threading?

``````julia> @btime double_center!(double_centered, S);
5.081 ms (2006 allocations: 15.59 MiB)

julia> @btime double_center!2(double_centered, S);
2.433 ms (18 allocations: 17.14 KiB)

julia> function double_center!3(double_centered::Array{Float64,2}, M::Array{Float64,2})
m1 = mean(M, dims = 1)
m_all = mean(m1)
double_centered .= M .- m1 .- mean(M, dims = 2) .+ m_all
end;

julia> @btime double_center!3(double_centered, S);
2.138 ms (18 allocations: 17.14 KiB)

julia> using Strided

julia> function double_center!4(double_centered::Array{Float64,2}, M::Array{Float64,2})
m2 = nothing
m1 = mean(M, dims = 1)
m_all = mean(m1)
@strided double_centered .= M .- m1 .- m2 .+ m_all
end;

julia> @btime double_center!4(double_centered, S);
928.991 μs (92 allocations: 24.34 KiB)
``````
5 Likes

Is that a use case for Tullio?

1 Like

Multi-threaded is nice but depends on how many cores you have available. We can get rid of an entire iteration by computing both means jointly (and the grand mean from one of those):

``````julia> function double_center!5(double_centered::Matrix{Float64}, M::Matrix{Float64})
m,n = size(M)
rowmeans = zeros(Float64, m)
colmeans = zeros(Float64, n)
for j in 1:n, i in 1:m
rowmeans[i] += M[i,j]
colmeans[j] += M[i,j]
end
grandmean = mean(rowmeans)
double_centered .= M .- rowmeans' .- colmeans .+ grandmean
end
double_center!5 (generic function with 1 method)

julia> function double_center!6(double_centered::Matrix{Float64}, M::Matrix{Float64})
m,n = size(M)
rowmeans = zeros(Float64, m)
colmeans = zeros(Float64, n)
for (i,col) in enumerate(eachcol(M))
rowmeans .+= col
colmeans[i] = mean(col)
end
grandmean = mean(rowmeans)
double_centered .= M .- rowmeans' .- colmeans .+ grandmean
end
double_center!6 (generic function with 1 method)
``````

For:

``````julia> @btime double_center!4(double_centered, S);
2.788 ms (73 allocations: 21.09 KiB)

julia> @btime double_center!(double_centered, S);
11.016 ms (2006 allocations: 15.59 MiB)

julia> @btime double_center!2(double_centered, S);
4.047 ms (18 allocations: 17.14 KiB)

julia> @btime double_center!4(double_centered, S);
2.789 ms (73 allocations: 21.09 KiB) # because my laptop has only 4 threads :(

julia> @btime double_center!5(double_centered, S);
5.571 ms (2 allocations: 15.88 KiB)

julia> @btime double_center!6(double_centered, S);
2.771 ms (2 allocations: 15.88 KiB)
``````
1 Like

I think `double_center!6` may need `rowmeans ./= n`, and could use `@inbounds`. In all of these, most of the time seems to be on the last line:

``````julia> @btime double_center!6(double_centered, S);
1.947 ms (2 allocations: 15.88 KiB)

julia> @btime double_center!6_minius(double_centered, S); # without the last line
409.064 μs (2 allocations: 15.88 KiB)

julia> @btime double_center!4_minus(double_centered, S);  # without the last line
358.813 μs (24 allocations: 18.41 KiB)

julia> @btime mean(\$S, dims=2);  # just one mean, the slower direction
322.528 μs (11 allocations: 8.61 KiB)
``````

Last line alone:

``````julia> v1 = rand(1000); v2 = rand(1000); v3= rand();

julia> @btime \$double_centered .= \$S .- \$v1 .- \$v2 .+ \$v3;
1.372 ms (0 allocations: 0 bytes)

julia> using Strided, LoopVectorization, Tullio

julia> @btime @strided \$double_centered .= \$S .- \$v1' .- \$v2 .+ \$v3;
1.527 ms (4 allocations: 160 bytes)

julia> @btime @avx \$double_centered .= \$S .- \$v1' .- \$v2 .+ \$v3;
605.135 μs (0 allocations: 0 bytes)

julia> @btime @tullio \$double_centered[i,j] = \$S[i,j] - \$v1[j] - \$v2[i] + \$v3;
524.520 μs (49 allocations: 4.78 KiB)

julia> v1r = reshape(v1,1,:); # strangely affects @strided strongly

julia> @btime @strided \$double_centered .= \$S .- \$v1r .- \$v2 .+ \$v3;
501.149 μs (54 allocations: 4.56 KiB)
``````

Yes, but it’s tripping over its own feet here, I think. Adjusting `Tullio.TILE[] = 10_000` to disable its attempt at tiled access (via an internal option which might change) brings this to `387.131 μs`. About half the `@avx` time, on a 2-core computer.

3 Likes

Well spotted, totally thought about that and forget to add it, sorry! `@inbounds` is a good idea, may as well try to add multithreading to the basis of that and see what happens 3 Likes

Tried to combine the iter-only-once approach with multi-threading with FLoops.jl:

``````julia> function dc!7minus(double_centered::Matrix{Float64}, M::Matrix{Float64})
m,n = size(M)
@floop for col in eachcol(M)
local m = mean(col)
@reduce() do (rowmeans=zeros(Float64,m); col), (colmeans=Float64[]; m)
colmeans = append!(colmeans, m)
rowmeans .+= col
end
end
grandmean = mean(colmeans)
rowmeans ./= n
(rowmeans, colmeans, grandmean)
end
``````

For the modest result of:

``````julia> @btime dc!minus(double_centered, S);
8.447 ms (2006 allocations: 15.59 MiB)

julia> @btime dc!4minus(double_centered, S);
1.051 ms (26 allocations: 17.70 KiB)

julia> @btime dc!6minus(double_centered, S);
668.100 μs (2 allocations: 15.88 KiB)

julia> @btime dc!7minus(double_centered, S);
806.000 μs (26 allocations: 24.92 KiB)
``````

These are all as above with the final line removed each time as suggested by @mcabbott. Combining this with one of the fast versions for the last line should give a nice overall result at last. I also just discovered that I ran with `nthreads == 1` which explains the missing speedup, will retry.

EDIT: Using 4 threads instead of 1 and fixing the mutation in the last line to use @strided and reshaped instead of transposed rowmean, all approaches seem to be more or less on par. Someone with a less pitiful setup than mine may take a look at the scaling with more cores, but it seems fine to me as is.

``````julia> @btime double_center!(double_centered, S);
14.453 ms (2058 allocations: 15.60 MiB)

julia> @btime double_center!6(double_centered, S);
2.110 ms (50 allocations: 19.38 KiB)

julia> @btime double_center!4(double_centered, S);
2.041 ms (89 allocations: 22.53 KiB)

julia> @btime double_center!7(double_centered, S);
2.041 ms (285 allocations: 77.53 KiB)
``````