Help with increasing performance of specific function

claudio20497 · January 21, 2021, 6:07pm

Hello,

I’m trying to translate in Julia the doubleCenter function from this article. The function takes a matrix as an argument, and doubly centers it, meaning that it subtracts to each element the mean of its row and the mean of its column and finally adds the mean of all the entries.
So far, I came up with this implementation, but I need to make it faster since it is in a critical loop:

using Distributions, BenchmarkTools

# pre allocate the array where the result will be stored
double_centered = Array{Float64,2}(undef, 1000, 1000)

# create fake data
const S = rand(Uniform(1.0, 10.0), 1000, 1000)

# function to be optimized
function double_center!(double_centered::Array{Float64,2}, M::Array{Float64,2})

    # evaluate the averages of all rows
    @views row_means = mean([M[i,:] for i in 1:size(M, 1) ])

    # evaluate the averages of all columns
    @views col_means =  mean([M[:,j] for j in 1:size(M, 2) ])

    # evaluate the mean of the matrix
    grand_mean::Float64 = mean(M)

    # perform double centering
    @. @views double_centered = ((M - row_means' ) - col_means) + grand_mean

end

# benchmark
@btime double_center!(double_centered, S);

2.775 ms (2006 allocations: 15.59 MiB)

I did my best, but couldn’t really write it much faster than this. Could you please help me? It would be great to save at least an order or two of magnitude.

Thanks in advance

Oscar_Smith · January 21, 2021, 6:19pm

using mean(M, dims=1) and mean(M, dims=2) makes this about 2x faster on my computer.

nilshg · January 21, 2021, 6:27pm

function double_center!2(double_centered::Array{Float64,2}, M::Array{Float64,2})
    @views double_centered .= ((M .- mean(M, dims = 1) ) .- mean(M, dims = 2)) .+ mean(M)
end

yields:

9.021 ms (2006 allocations: 15.59 MiB) # original
3.409 ms (18 allocations: 17.14 KiB) # update

Edit: Yes, what Oscar says

mcabbott · January 21, 2021, 6:56pm

You can save a little more by computing the overall mean from one of the others. And can try multi-threading?

julia> @btime double_center!(double_centered, S);
  5.081 ms (2006 allocations: 15.59 MiB)

julia> @btime double_center!2(double_centered, S);
  2.433 ms (18 allocations: 17.14 KiB)

julia> function double_center!3(double_centered::Array{Float64,2}, M::Array{Float64,2})
           m1 = mean(M, dims = 1)
           m_all = mean(m1)
           double_centered .= M .- m1 .- mean(M, dims = 2) .+ m_all
       end;

julia> @btime double_center!3(double_centered, S);
  2.138 ms (18 allocations: 17.14 KiB)

julia> using Strided

julia> function double_center!4(double_centered::Array{Float64,2}, M::Array{Float64,2})
           m2 = nothing
           task = Threads.@spawn m2 = mean(M, dims = 2)
           m1 = mean(M, dims = 1)
           m_all = mean(m1)
           wait(task)
           @strided double_centered .= M .- m1 .- m2 .+ m_all
       end;

julia> @btime double_center!4(double_centered, S);
  928.991 μs (92 allocations: 24.34 KiB)

lmiq · January 21, 2021, 7:06pm

Is that a use case for Tullio?

FPGro · January 21, 2021, 7:23pm

Multi-threaded is nice but depends on how many cores you have available. We can get rid of an entire iteration by computing both means jointly (and the grand mean from one of those):

julia> function double_center!5(double_centered::Matrix{Float64}, M::Matrix{Float64})
         m,n = size(M)
         rowmeans = zeros(Float64, m)
         colmeans = zeros(Float64, n)
         for j in 1:n, i in 1:m
           rowmeans[i] += M[i,j]
           colmeans[j] += M[i,j]
         end
         grandmean = mean(rowmeans)
         double_centered .= M .- rowmeans' .- colmeans .+ grandmean
       end
double_center!5 (generic function with 1 method)

julia> function double_center!6(double_centered::Matrix{Float64}, M::Matrix{Float64})
         m,n = size(M)
         rowmeans = zeros(Float64, m)
         colmeans = zeros(Float64, n)
         for (i,col) in enumerate(eachcol(M))
           rowmeans .+= col
           colmeans[i] = mean(col)
         end
         grandmean = mean(rowmeans)
         double_centered .= M .- rowmeans' .- colmeans .+ grandmean
       end
double_center!6 (generic function with 1 method)

For:

julia> @btime double_center!4(double_centered, S);
  2.788 ms (73 allocations: 21.09 KiB)

julia> @btime double_center!(double_centered, S);
  11.016 ms (2006 allocations: 15.59 MiB)

julia> @btime double_center!2(double_centered, S);
  4.047 ms (18 allocations: 17.14 KiB)

julia> @btime double_center!4(double_centered, S);
  2.789 ms (73 allocations: 21.09 KiB) # because my laptop has only 4 threads :(

julia> @btime double_center!5(double_centered, S);
  5.571 ms (2 allocations: 15.88 KiB)

julia> @btime double_center!6(double_centered, S);
  2.771 ms (2 allocations: 15.88 KiB)

mcabbott · January 21, 2021, 7:49pm

I think double_center!6 may need rowmeans ./= n, and could use @inbounds. In all of these, most of the time seems to be on the last line:

julia> @btime double_center!6(double_centered, S);
  1.947 ms (2 allocations: 15.88 KiB)

julia> @btime double_center!6_minius(double_centered, S); # without the last line
  409.064 μs (2 allocations: 15.88 KiB)

julia> @btime double_center!4_minus(double_centered, S);  # without the last line
  358.813 μs (24 allocations: 18.41 KiB)

julia> @btime mean($S, dims=2);  # just one mean, the slower direction
  322.528 μs (11 allocations: 8.61 KiB)

Last line alone:

julia> v1 = rand(1000); v2 = rand(1000); v3= rand();

julia> @btime $double_centered .= $S .- $v1 .- $v2 .+ $v3;
  1.372 ms (0 allocations: 0 bytes)

julia> using Strided, LoopVectorization, Tullio

julia> @btime @strided $double_centered .= $S .- $v1' .- $v2 .+ $v3;
  1.527 ms (4 allocations: 160 bytes)

julia> @btime @avx $double_centered .= $S .- $v1' .- $v2 .+ $v3;
  605.135 μs (0 allocations: 0 bytes)

julia> @btime @tullio $double_centered[i,j] = $S[i,j] - $v1[j] - $v2[i] + $v3;
  524.520 μs (49 allocations: 4.78 KiB)

julia> v1r = reshape(v1,1,:); # strangely affects @strided strongly

julia> @btime @strided $double_centered .= $S .- $v1r .- $v2 .+ $v3;
  501.149 μs (54 allocations: 4.56 KiB)

Yes, but it’s tripping over its own feet here, I think. Adjusting Tullio.TILE[] = 10_000 to disable its attempt at tiled access (via an internal option which might change) brings this to 387.131 μs. About half the @avx time, on a 2-core computer.

FPGro · January 21, 2021, 7:52pm

Well spotted, totally thought about that and forget to add it, sorry! @inbounds is a good idea, may as well try to add multithreading to the basis of that and see what happens

FPGro · January 21, 2021, 10:44pm

Tried to combine the iter-only-once approach with multi-threading with FLoops.jl:

julia> function dc!7minus(double_centered::Matrix{Float64}, M::Matrix{Float64})
         m,n = size(M)
         @floop for col in eachcol(M)
           local m = mean(col)
           @reduce() do (rowmeans=zeros(Float64,m); col), (colmeans=Float64[]; m)
           colmeans = append!(colmeans, m)
           rowmeans .+= col
           end
         end
         grandmean = mean(colmeans)
         rowmeans ./= n
         (rowmeans, colmeans, grandmean)
       end

For the modest result of:

julia> @btime dc!minus(double_centered, S);
  8.447 ms (2006 allocations: 15.59 MiB)

julia> @btime dc!4minus(double_centered, S);
  1.051 ms (26 allocations: 17.70 KiB)

julia> @btime dc!6minus(double_centered, S);
  668.100 μs (2 allocations: 15.88 KiB)

julia> @btime dc!7minus(double_centered, S);
  806.000 μs (26 allocations: 24.92 KiB)

These are all as above with the final line removed each time as suggested by @mcabbott. Combining this with one of the fast versions for the last line should give a nice overall result at last. I also just discovered that I ran with nthreads == 1 which explains the missing speedup, will retry.

EDIT: Using 4 threads instead of 1 and fixing the mutation in the last line to use @strided and reshaped instead of transposed rowmean, all approaches seem to be more or less on par. Someone with a less pitiful setup than mine may take a look at the scaling with more cores, but it seems fine to me as is.

julia> @btime double_center!(double_centered, S);
  14.453 ms (2058 allocations: 15.60 MiB)

julia> @btime double_center!6(double_centered, S);
  2.110 ms (50 allocations: 19.38 KiB)

julia> @btime double_center!4(double_centered, S);
  2.041 ms (89 allocations: 22.53 KiB)

julia> @btime double_center!7(double_centered, S);
  2.041 ms (285 allocations: 77.53 KiB)

Topic		Replies	Views
Improving Speed for a Novice Performance	7	2198	May 14, 2018
Speeding up my code Performance	7	754	February 23, 2022
C implementation of function being ~4 times faster even absence of allocs Performance ccall	28	1381	March 5, 2025
Speed up column looping in matrix Performance	6	756	July 26, 2021
How can i improve the performance of the code? Performance question , performance	9	368	March 28, 2023

Help with increasing performance of specific function

Related topics