Help with increasing performance of specific function


I’m trying to translate in Julia the doubleCenter function from this article. The function takes a matrix as an argument, and doubly centers it, meaning that it subtracts to each element the mean of its row and the mean of its column and finally adds the mean of all the entries.
So far, I came up with this implementation, but I need to make it faster since it is in a critical loop:

using Distributions, BenchmarkTools

# pre allocate the array where the result will be stored
double_centered = Array{Float64,2}(undef, 1000, 1000)

# create fake data
const S = rand(Uniform(1.0, 10.0), 1000, 1000)

# function to be optimized
function double_center!(double_centered::Array{Float64,2}, M::Array{Float64,2})

    # evaluate the averages of all rows
    @views row_means = mean([M[i,:] for i in 1:size(M, 1) ])

    # evaluate the averages of all columns
    @views col_means =  mean([M[:,j] for j in 1:size(M, 2) ])

    # evaluate the mean of the matrix
    grand_mean::Float64 = mean(M)

    # perform double centering
    @. @views double_centered = ((M - row_means' ) - col_means) + grand_mean


# benchmark
@btime double_center!(double_centered, S);
2.775 ms (2006 allocations: 15.59 MiB)

I did my best, but couldn’t really write it much faster than this. Could you please help me? It would be great to save at least an order or two of magnitude.

Thanks in advance

1 Like

using mean(M, dims=1) and mean(M, dims=2) makes this about 2x faster on my computer.

function double_center!2(double_centered::Array{Float64,2}, M::Array{Float64,2})
    @views double_centered .= ((M .- mean(M, dims = 1) ) .- mean(M, dims = 2)) .+ mean(M)


9.021 ms (2006 allocations: 15.59 MiB) # original
3.409 ms (18 allocations: 17.14 KiB) # update

Edit: Yes, what Oscar says :smiley:


You can save a little more by computing the overall mean from one of the others. And can try multi-threading?

julia> @btime double_center!(double_centered, S);
  5.081 ms (2006 allocations: 15.59 MiB)

julia> @btime double_center!2(double_centered, S);
  2.433 ms (18 allocations: 17.14 KiB)

julia> function double_center!3(double_centered::Array{Float64,2}, M::Array{Float64,2})
           m1 = mean(M, dims = 1)
           m_all = mean(m1)
           double_centered .= M .- m1 .- mean(M, dims = 2) .+ m_all

julia> @btime double_center!3(double_centered, S);
  2.138 ms (18 allocations: 17.14 KiB)

julia> using Strided

julia> function double_center!4(double_centered::Array{Float64,2}, M::Array{Float64,2})
           m2 = nothing
           task = Threads.@spawn m2 = mean(M, dims = 2)
           m1 = mean(M, dims = 1)
           m_all = mean(m1)
           @strided double_centered .= M .- m1 .- m2 .+ m_all

julia> @btime double_center!4(double_centered, S);
  928.991 μs (92 allocations: 24.34 KiB)

Is that a use case for Tullio?

1 Like

Multi-threaded is nice but depends on how many cores you have available. We can get rid of an entire iteration by computing both means jointly (and the grand mean from one of those):

julia> function double_center!5(double_centered::Matrix{Float64}, M::Matrix{Float64})
         m,n = size(M)
         rowmeans = zeros(Float64, m)
         colmeans = zeros(Float64, n)
         for j in 1:n, i in 1:m
           rowmeans[i] += M[i,j]
           colmeans[j] += M[i,j]
         grandmean = mean(rowmeans)
         double_centered .= M .- rowmeans' .- colmeans .+ grandmean
double_center!5 (generic function with 1 method)

julia> function double_center!6(double_centered::Matrix{Float64}, M::Matrix{Float64})
         m,n = size(M)
         rowmeans = zeros(Float64, m)
         colmeans = zeros(Float64, n)
         for (i,col) in enumerate(eachcol(M))
           rowmeans .+= col
           colmeans[i] = mean(col)
         grandmean = mean(rowmeans)
         double_centered .= M .- rowmeans' .- colmeans .+ grandmean
double_center!6 (generic function with 1 method)


julia> @btime double_center!4(double_centered, S);
  2.788 ms (73 allocations: 21.09 KiB)

julia> @btime double_center!(double_centered, S);
  11.016 ms (2006 allocations: 15.59 MiB)

julia> @btime double_center!2(double_centered, S);
  4.047 ms (18 allocations: 17.14 KiB)

julia> @btime double_center!4(double_centered, S);
  2.789 ms (73 allocations: 21.09 KiB) # because my laptop has only 4 threads :(

julia> @btime double_center!5(double_centered, S);
  5.571 ms (2 allocations: 15.88 KiB)

julia> @btime double_center!6(double_centered, S);
  2.771 ms (2 allocations: 15.88 KiB)
1 Like

I think double_center!6 may need rowmeans ./= n, and could use @inbounds. In all of these, most of the time seems to be on the last line:

julia> @btime double_center!6(double_centered, S);
  1.947 ms (2 allocations: 15.88 KiB)

julia> @btime double_center!6_minius(double_centered, S); # without the last line
  409.064 μs (2 allocations: 15.88 KiB)

julia> @btime double_center!4_minus(double_centered, S);  # without the last line
  358.813 μs (24 allocations: 18.41 KiB)

julia> @btime mean($S, dims=2);  # just one mean, the slower direction
  322.528 μs (11 allocations: 8.61 KiB)

Last line alone:

julia> v1 = rand(1000); v2 = rand(1000); v3= rand();

julia> @btime $double_centered .= $S .- $v1 .- $v2 .+ $v3;
  1.372 ms (0 allocations: 0 bytes)

julia> using Strided, LoopVectorization, Tullio

julia> @btime @strided $double_centered .= $S .- $v1' .- $v2 .+ $v3;
  1.527 ms (4 allocations: 160 bytes)

julia> @btime @avx $double_centered .= $S .- $v1' .- $v2 .+ $v3;
  605.135 μs (0 allocations: 0 bytes)

julia> @btime @tullio $double_centered[i,j] = $S[i,j] - $v1[j] - $v2[i] + $v3;
  524.520 μs (49 allocations: 4.78 KiB)

julia> v1r = reshape(v1,1,:); # strangely affects @strided strongly

julia> @btime @strided $double_centered .= $S .- $v1r .- $v2 .+ $v3;
  501.149 μs (54 allocations: 4.56 KiB)

Yes, but it’s tripping over its own feet here, I think. Adjusting Tullio.TILE[] = 10_000 to disable its attempt at tiled access (via an internal option which might change) brings this to 387.131 μs. About half the @avx time, on a 2-core computer.


Well spotted, totally thought about that and forget to add it, sorry! @inbounds is a good idea, may as well try to add multithreading to the basis of that and see what happens :slight_smile:


Tried to combine the iter-only-once approach with multi-threading with FLoops.jl:

julia> function dc!7minus(double_centered::Matrix{Float64}, M::Matrix{Float64})
         m,n = size(M)
         @floop for col in eachcol(M)
           local m = mean(col)
           @reduce() do (rowmeans=zeros(Float64,m); col), (colmeans=Float64[]; m)
           colmeans = append!(colmeans, m)
           rowmeans .+= col
         grandmean = mean(colmeans)
         rowmeans ./= n
         (rowmeans, colmeans, grandmean)

For the modest result of:

julia> @btime dc!minus(double_centered, S);
  8.447 ms (2006 allocations: 15.59 MiB)

julia> @btime dc!4minus(double_centered, S);
  1.051 ms (26 allocations: 17.70 KiB)

julia> @btime dc!6minus(double_centered, S);
  668.100 μs (2 allocations: 15.88 KiB)

julia> @btime dc!7minus(double_centered, S);
  806.000 μs (26 allocations: 24.92 KiB)

These are all as above with the final line removed each time as suggested by @mcabbott. Combining this with one of the fast versions for the last line should give a nice overall result at last. I also just discovered that I ran with nthreads == 1 which explains the missing speedup, will retry.

EDIT: Using 4 threads instead of 1 and fixing the mutation in the last line to use @strided and reshaped instead of transposed rowmean, all approaches seem to be more or less on par. Someone with a less pitiful setup than mine may take a look at the scaling with more cores, but it seems fine to me as is.

julia> @btime double_center!(double_centered, S);
  14.453 ms (2058 allocations: 15.60 MiB)

julia> @btime double_center!6(double_centered, S);
  2.110 ms (50 allocations: 19.38 KiB)

julia> @btime double_center!4(double_centered, S);
  2.041 ms (89 allocations: 22.53 KiB)

julia> @btime double_center!7(double_centered, S);
  2.041 ms (285 allocations: 77.53 KiB)