Improve performance of matrix computation


#1

I’m implementing a clustering algorithm where N is the number of samples, D the number of features and K the number of clusters. For each cluster 1 \leq k \leq K, I have to compute a probability that an individual belongs to a specific cluster p(C_n = k | \dots). In a simplified formulation, this amounts to compute:

p(C_n = k | \dots) = \prod_{d = 1}^D p(x_{dn} | \theta_{dk})

where the features are indexed by d, x_{dn} is the data corresponding to the d-th feature of the n-th sample, and p is the probability density function of some distribution with \theta_{dk} the parameters associated to the distribution. To simplify, let’s say that we consider only normal distributions, this reduces to (taking the log):

\log p(C_n = k | \dots) = \sum_{d = 1}^D \log p_{\mathcal{N}(\mu_{dk}, \sigma_{dk})}(x_{dn} | \mu_{dk}, \sigma_{dk})

So my first try at implementing this was the following:

for k = 1:K
    for n = 1:N
        p[n,k] = sum(map(d -> log_normal_pdf(x[d,n], μ[d,k], σ[d,k]), 1:D))
    end
end

where x is a D\times N data matrix, μ and σ are D \times K matrices for the different mean and standard deviations vectors for each cluster (and log_normal_pdf computes the log-pdf of a normal distribution). Unfortunately this was very slow, so I rewrote the log-probability as follows:

\log p(C_n = k | \dots) = - \sum_{d = 1}^D \log \sigma_{dk} - \frac{D}{2} \log(2 \pi) - \frac{1}{2} \sum_{d=1}^D \left(\frac{x_{dn} - \mu_{dk}}{\sigma_{dk}} \right)^2

and made use of the mapslices function:

sum_log_sigma = vec(sum(log.(σ), 1))
for k = 1:K
    p[:,k] = - sum_log_sigma[k] - D / 2 * log(2 * π) - 1 / 2 * vec(sum(mapslices(z -> ((z - vec(μ[:,k])) ./ vec(σ[:,k])) .^ 2, x, 1), 1))
end

to perform vector operations for each cluster k for all the different samples 1 \leq n \leq N at once. It’s better than the previous formulation, in particular with large number of features D, but still too slow for what I’d like to do. I was therefore wondering if there were a cleverer implementation to perform such a task, involving different matrices and where it’s not a straightforward matrix product.

Would that be easy to parallelize the code w.r.t. the different clusters…? I don’t know much about parallelism in Julia I must say.

Anyway, thank you for taking the time to read all this and for any possible suggestions. If something is not clear in my message, I’d be glad to clarify it.

EDIT: following @kristoffer.carlsson’s sensible remark, I provide the following minimal example.
EDIT2: updated the code by also adding a parallel version.
EDIT3: added all the versions proposed in this discussion + another one based on @fastmath

using Base.Test

srand(10)

N = 100
D = 10000
K = 5
x = rand(D, N)
μ = rand(D, K)
σ = rand(D, K)
x0 = x[1:2,1:2]
μ0 = μ[1:2,1:2]
σ0 = σ[1:2,1:2]

log_normal_pdf(x, μ, σ) = - log(σ * sqrt(2 * π)) - (x - μ) ^ 2 / (2 * σ ^ 2)

function f1(x::Matrix, μ::Matrix, σ::Matrix)
    D, K = size(μ)
    D, N = size(x)
    p = zeros(N, K)
    for k = 1:K
        for n = 1:N
            p[n,k] = sum(map(d -> log_normal_pdf(x[d,n], μ[d,k], σ[d,k]), 1:D))
        end
    end
    return p
end

function f2(x::Matrix, μ::Matrix, σ::Matrix)
    D, K = size(μ)
    D, N = size(x)
    p = zeros(N, K)
    sum_log_sigma = vec(sum(log.(σ), 1))
    for k = 1:K
        p[:,k] = - sum_log_sigma[k] - D / 2 * log(2 * π) - 1 / 2 * vec(sum(mapslices(z -> ((z - vec(μ[:,k])) ./ vec(σ[:,k])) .^ 2, x, 1), 1))
    end
    return p
end

function f2_parallel(x::Matrix, μ::Matrix, σ::Matrix)
    D, K = size(μ)
    D, N = size(x)
    p = SharedArray{Float64}(N, K)
    sum_log_sigma = vec(sum(log.(σ), 1))
    @sync @parallel for k = 1:K
        p[:,k] = - sum_log_sigma[k] - D / 2 * log(2 * π) - 1 / 2 * vec(sum(mapslices(z -> ((z - vec(μ[:,k])) ./ vec(σ[:,k])) .^ 2, x, 1), 1))
    end
    return p
end

function f3(x::Matrix, μ::Matrix, σ::Matrix)
    D, K = size(μ)
    D, N = size(x)
    p = zeros(K, N)
    sum_log_sigma = vec(sum(log.(σ), 1))
    c = -sum_log_sigma - D / 2 * log(2 * π)
    for j = 1:N
        z = vec(sum(((x[:, j] .- μ) ./ σ) .^ 2, 1))
        p[:, j] .= c - 1 / 2 * z
    end
    return p'
end

function f3_threads(x::Matrix, μ::Matrix, σ::Matrix)
    D, K = size(μ)
    D, N = size(x)
    p = zeros(K, N)
    sum_log_sigma = vec(sum(log.(σ), 1))
    c = -sum_log_sigma - D / 2 * log(2 * π)
    Threads.@threads for j = 1:N
        z = vec(sum(((x[:, j] .- μ) ./ σ) .^ 2, 1))
        p[:, j] .= c - 1 / 2 * z
    end
    return p'
end

function f4b_mt(x::Matrix, μ::Matrix, σ::Matrix)
    DD, K = size(μ)
    D, N = size(x)
    assert(D == DD)
    p = zeros(N, K)
    cc = 0.5 * log(2.0 * pi) * D
    # Threads.@threads for k = 1:K
    for k = 1:K
        logσ = cc
        iσ = zeros(D)
        @simd for d = 1:D
            logσ += log(σ[d,k])
        end
        @simd for d = 1:D
            iσ[d] = 0.5 / (σ[d,k] * σ[d,k])
        end
        @inbounds for n = 1:N
            pnk = -logσ
            @simd for d = 1:D
                pnk -= (x[d,n] - μ[d,k]) * (x[d,n] - μ[d,k]) * iσ[d]
            end
            p[n,k] = pnk
        end
    end
    return p
end

function f4b_mt_threads(x::Matrix, μ::Matrix, σ::Matrix)
    DD, K = size(μ)
    D, N = size(x)
    assert(D == DD)
    p = zeros(N, K)
    cc = 0.5 * log(2.0 * pi) * D
    Threads.@threads for k = 1:K
        logσ = cc
        iσ = zeros(D)
        @simd for d = 1:D
            logσ += log(σ[d,k])
        end
        @simd for d = 1:D
            iσ[d] = 0.5 / (σ[d,k] * σ[d,k])
        end
        @inbounds for n = 1:N
            pnk = -logσ
            @simd for d = 1:D
                pnk -= (x[d,n] - μ[d,k]) * (x[d,n] - μ[d,k]) * iσ[d]
            end
            p[n,k] = pnk
        end
    end
    return p
end

function f4b_mt_threads_fast(x::Matrix, μ::Matrix, σ::Matrix)
    DD, K = size(μ)
    D, N = size(x)
    assert(D == DD)
    p = zeros(N, K)
    cc = 0.5 * log(2.0 * pi) * D
    Threads.@threads for k = 1:K
        logσ = cc
        iσ = zeros(D)
        @fastmath @simd for d = 1:D
            logσ += log(σ[d,k])
        end
        @fastmath @simd for d = 1:D
            iσ[d] = 0.5 / (σ[d,k] * σ[d,k])
        end
        @fastmath @inbounds for n = 1:N
            pnk = -logσ
            @simd for d = 1:D
                pnk -= (x[d,n] - μ[d,k]) * (x[d,n] - μ[d,k]) * iσ[d]
            end
            p[n,k] = pnk
        end
    end
    return p
end

p10 = f1(x0, μ0, σ0)
p20 = f2(x0, μ0, σ0)
p20_parallel = f2_parallel(x0, μ0, σ0)
p30 = f3(x0, μ0, σ0)
p3_threads0 = f3_threads(x0, μ0, σ0)
p4b_mt0 = f4b_mt(x0, μ0, σ0)
p4b_mt_threads0 = f4b_mt_threads(x0, μ0, σ0)
p4b_mt_threads_fast0 = f4b_mt_threads_fast(x0, μ0, σ0)

function print_name(name, space = 30)
    print(name)
    for i = 1:space-length(name) print(" ") end
end

print_name("f1"); @time p1 = f1(x, μ, σ)
print_name("f2"); @time p2 = f2(x, μ, σ)
print_name("f2_parallel"); @time p2_parallel = f2_parallel(x, μ, σ)
print_name("f3"); @time p3 = f3(x, μ, σ)
print_name("f3_threads"); @time p3_threads = f3_threads(x, μ, σ)
print_name("f4b_mt"); @time p4b_mt = f4b_mt(x, μ, σ)
print_name("f4b_mt_threads"); @time p4b_mt_threads = f4b_mt_threads(x, μ, σ)
print_name("f4b_mt_threads_fast"); @time p4b_mt_threads_fast = f4b_mt_threads_fast(x, μ, σ)

@test p1 ≈ p2 ≈ p2_parallel ≈ p3 ≈ p3_threads ≈ p4b_mt ≈ p4b_mt_threads ≈ p4b_mt_threads_fast

Redefining an integer makes computing time blow up
#2

Please provide enough code so that we can run it (including input arguments to functions). Also it would be good if you showed how you did you benchmarking.


#3

Indeed, I should have done it in the first place, thanks for the reminder! I edited my first message so that it be self-consistent.


#4

One simple way to parallelize it is using @parallel:

@parallel for k = 1:K
        p[:,k] = - sum_log_sigma[k] - D / 2 * log(2 * π) - 1 / 2 * vec(sum(mapslices(z -> ((z - vec(μ[:,k])) ./ vec(σ[:,k])) .^ 2, x, 1), 1))
    end

Don’t forget to add more processes using addprocs(<# of CPUs - 1>). For my quad-core machine it gives ~3.5 speedup (perhaps not 4x because of data transfer).


#5

And even a bit faster version (13.8ms vs 11.1ms on my machine) iterating over observations and vectorizeing over variables and clusters:

function f3(x::Matrix, μ::Matrix, σ::Matrix)
    D, K = size(μ)
    D, N = size(x)
    p = zeros(K, N)
    sum_log_sigma = vec(sum(log.(σ), 1))
    c = -sum_log_sigma - D / 2 * log(2 * π)
    for j = 1:N
        z = squeeze(sum((x[:, j] .- μ ./ σ) .^ 2, 1), 1)
        p[:, j] .= c - 1 / 2 * z
    end
    return p
end

Passing p from outside and reusing its memory gives ~10% improvement, but code becomes uglier.


#6

Thanks for the tip. If I understood correctly, I now need to define p as a SharedArray if I want it to be computed correctly. This is what I did in f2_parallel. When I run this code with julia -p 2 main.jl it works as expected and returns:

  2.720650 seconds (79.24 M allocations: 1.218 GiB, 1.73% gc time)
  0.110910 seconds (31.27 k allocations: 192.489 MiB, 16.70% gc time)
  0.026782 seconds (1.36 k allocations: 470.042 KiB)

So the first strange thing is that although I used only 2 processes, the time is divided by 4… But still, p1, p2 and p2_parallel are equal, as ensured by the test. How can this be explained? Is it truly the overall time spent on all workers that is displayed here?

Second, if I run this with julia -p 3 main.jl or whatever the number of workers greater than 3, the test fails:

  3.166441 seconds (79.24 M allocations: 1.218 GiB, 2.33% gc time)
  0.120809 seconds (31.27 k allocations: 192.489 MiB, 16.70% gc time)
  0.252574 seconds (2.28 k allocations: 525.289 KiB)
Test Failed
  Expression: p1 ≈ p2 ≈ p2_parallel

as now p2_parallel the first two columns (corresponding to k = 1, 2) are only zeroes, whereas the last three columns (corresponding to k = 3,4,5) are correctly computed. Did I do something wrong? I guess Julia’s parallelism is clever enough to figure out how to distribute the computations…

PS: I have an 8-core machine.


#7

Ah, my fault, I forgot that @parallel without reduction is actually asynchronous and you need @sync @parallel. With syncing, however, data transfer becomes a bottleneck and parallel version takes even longer than serial version.

One option is to try threading instead: add threads to Julia as (in terminal)

export JULIA_NUM_THREADS=4

And changing @parallel to Threads.@threads and SharedArray to a normal one.

However, globally I’d make a bet on correct memory layout (column-first arrays) and broadcasting which can often parallelize operations automatically. My latest example f3 seems to be broken right now (sorry for that, I’ll try to fix it a later today), but you should get the general idea.


#8

I tend to find optimization easier when just writing everything in clear loops first. For example, fnope with your definition of logpdf expanded; then, all transformations to obtain f4 become obvious:

function fnope(x::Matrix, mu::Matrix, sigma::Matrix)
    D, K = size(mu)
    D, N = size(x)
    p = zeros(N, K)
    for k = 1:K
        for n = 1:N
            for d = 1:D
                p[n,k] -= log(sigma[d,k]*sqrt(2*pi))
                p[n,k] -= (x[d,n]-mu[d,k])^2 / (2.0* sigma[d,k]^2)
            end
        end
    end
    return p
end
    
function f4(x::Matrix, mu::Matrix, sigma::Matrix)
    D, KK = size(mu)
    D, N = size(x)
    p = zeros(N, KK)
    isigma = zeros(D)
    cc = 0.5*log(2.0*pi)*D
    for k = 1:KK
        logsigma = cc
        @simd for d=1:D 
            logsigma += log(sigma[d,k])
        end
         @simd for d=1:D
            isigma[d] = 0.5 / (sigma[d,k]*sigma[d,k])
        end
        for n = 1:N
            p[n,k] -= logsigma    
           @simd for d = 1:D               
                p[n,k] -= (x[d,n]-mu[d,k])*(x[d,n]-mu[d,k]) * isigma[d]
            end
        end
    end
    return p
end

#post  warm-up
@time p1 = f1(x, μ, σ);
@time p2 = f2(x, μ, σ);
@time pnope = fnope(x, μ, σ); 
@time p4 = f4(x, μ, σ);  

  3.584316 seconds (79.24 M allocations: 1.218 GiB, 2.54% gc time)
  0.144864 seconds (63.48 k allocations: 384.598 MiB, 12.26% gc time)
  0.214383 seconds (5 allocations: 4.219 KiB)
  0.015099 seconds (7 allocations: 82.422 KiB)

That’s single-threaded. Feel free to use all your 8 cores to make it faster.


#9

Here’s a fixed version of f3 that takes 9.6 ms (for comparison, f2 takes 30.3 ms):

function f3(x::Matrix, μ::Matrix, σ::Matrix)
    D, K = size(μ)
    D, N = size(x)
    p = zeros(K, N)
    sum_log_sigma = vec(sum(log.(σ), 1))
    c = -sum_log_sigma - D / 2 * log(2 * π)
    for j = 1:N
        z = vec(sum(((x[:, j] .- μ) ./ σ) .^ 2, 1))
        p[:, j] .= c - 1 / 2 * z
    end
    return p'
end

and its threaded version (6.5 ms using 4 hyperthreads):

function f3(x::Matrix, μ::Matrix, σ::Matrix)
    D, K = size(μ)
    D, N = size(x)
    p = zeros(K, N)
    sum_log_sigma = vec(sum(log.(σ), 1))
    c = -sum_log_sigma - D / 2 * log(2 * π)
    Threads.@threads for j = 1:N
        z = vec(sum(((x[:, j] .- μ) ./ σ) .^ 2, 1))
        p[:, j] .= c - 1 / 2 * z
    end
    return p'
end

At the same time I like @foobar_lv2’s version since it allocates almost no memory. On my laptop f4 runs in 8.4 ms, but for whatever reason applying Threads.@thread to the outermost loop only hurts performance.


#10

Maybe we need to help the compiler a little and add some @inbounds and make more obvious what belongs into registers as compared to memory? Try the following f4b_mt.

f5 is probably out-of-scope, because I reformulated the problem into an actual matrix computation that uses the mighty GEMM, but this only becomes really worthwhile for larger K (try K=50). Optimizing is too much fun.

function f4b_mt(x::Matrix, mu::Matrix, sigma::Matrix)
    DD, K = size(mu)
    D, N = size(x)
    assert(D==DD)
    p = zeros(N, K)
    
    cc = 0.5*log(2.0*pi)*D
    Threads.@threads for k = 1:K
        logsigma = cc
        isigma = zeros(D)
        @simd for d=1:D 
            logsigma += log(sigma[d,k])
        end
         @simd for d=1:D
            isigma[d] = 0.5 / (sigma[d,k]*sigma[d,k])
        end
        @inbounds for n = 1:N
            pnk = -logsigma
               
           @simd for d = 1:D               
                pnk -= (x[d,n]-mu[d,k])*(x[d,n]-mu[d,k]) * isigma[d]
            end
            p[n,k] = pnk
        end
    end
    return p
end


function f5(x::Matrix, mu::Matrix, sigma::Matrix)
    DD, K = size(mu)
    D, N = size(x)
    DDD,KK = size(sigma)    
    assert(D==DD==DDD)
    assert(K==KK)
    p = zeros(N, K)
    
    cc = 0.5*log(2.0*pi)*D
    logsigma = zeros(K)
    isigma = 0.5 ./(sigma.*sigma)
    for k = 1:K
        logsigma[k] = cc
        @simd for d=1:D
           logsigma[k] += log(sigma[d,k])
           logsigma[k] += mu[d,k]*isigma[d,k]*mu[d,k]
        end
    end
    isigma_mu = isigma .* mu
    xmu = At_mul_B(x, isigma_mu) 
    xsq = x.*x
    xsq = At_mul_B(xsq, isigma)
    p .= -xsq .- logsigma' .+ 2.* xmu
    return p
end

#11

Thank you very much @foobar_lv2 and @dfdx for all your suggestions. It took me some time to reply as I was unfamiliar with Julia’s parallelism, I’ve taken the time to carefully read the documentation and understand what all the @parallel, @sync, Threads.@threads, @simd and @inbounds actually do. I’ve updated the code in my first message so that it contain all your updated versions. I’ve actually added a final version where I make use of the @fastmath macro, which further decreases the time by 50% while passing the test.

The final results on my machine look as follows (even though f2_parallel does not make much sense anymore):

f1                              3.045045 seconds (79.24 M allocations: 1.218 GiB, 1.54% gc time)
f2                              0.100420 seconds (31.27 k allocations: 192.489 MiB, 15.16% gc time)
f2_parallel                     0.068706 seconds (31.87 k allocations: 192.522 MiB, 15.29% gc time)
f3                              0.025778 seconds (1.11 k allocations: 46.231 MiB, 13.31% gc time)
f3_threads                      0.011847 seconds (1.10 k allocations: 45.314 MiB, 24.31% gc time)
f4b_mt                          0.007526 seconds (16 allocations: 395.281 KiB)
f4b_mt_threads                  0.001734 seconds (14 allocations: 160.984 KiB)
f4b_mt_threads_fast             0.001170 seconds (17 allocations: 395.359 KiB)

so all in all I’ve gained a factor of almost 100 compared to my first “clever” implementation, and a factor 3,000 compared to the dumb one.

I honestly did not expect to gain so much speed-up, and I learnt a lot with this topic, so thank you very much once again…!

PS: I did not consider your f5 function @foobar_lv2 since I’m always going to deal with K < 10 .