Improving Speed for a Novice


#1

Hi:
I am mostly a R user, but am currently developing a package that has the primary functions written in Julia. This package is for estimating very large (inverse) co-variance matrices. I have considerable speed gains compared to R, already, but as a novice I was looking for some input to make the Julia code even more efficient.

function col_means(x)
    mean(x, 1)
end

function center_data(x)
    x .- col_means(x)
end

function post_prob(x)
    grt_zero = mean(repeat([0], inner = size(x, 1)) .> x)
    ls_zero =  1 - grt_zero
    1 - min(ls_zero, grt_zero)
end

function BB_est(x, sims, prob)
    X = center_data(x)
    n = size(X, 1)
    p = size(X, 2)
    prob = convert(Float64, prob)
    offdiag = (p * (p - 1)) / 2

    loop_max = convert(Int64, offdiag)

    temp = Exponential(1)
    sims= convert(Int64, sims)
    exp_drw = rand(temp, sims, n)

    dir_weights = exp_drw ./sum(exp_drw, 2)

   A = zeros(sims, offdiag)
   B = zeros((1, offdiag))
   C = zeros(sims, p)
   mat_point = zeros(p,p)
   mat_sig = zeros(p, p)
   idx = tril(trues(size(mat_point)), -1);

for i = 1:sims
    t = dir_weights[i, 1:n]
    d = X .* sqrt.(t)
    bb_cov = (d' * d)
    inv_cov =  inv(bb_cov)
    d = diag(inv_cov)
    inv_cov[diagind(inv_cov)] = 0

    test = inv_cov[idx]
    rmv_0 = filter!(x->x≠ 0,test)
    A[i,:] = rmv_0
    C[i,:] = d
end

for i in 1:loop_max
    temp1 = post_prob(A[:,i])
    B[1,i] = ifelse(temp1 > prob, 1, 0)
 end

    mat_sig[idx] = B
    mat_point[idx] = mean(A, 1)
    mat_point[diagind(mat_point)] = mean(C, 1)
    force_sym(mat_sig), force_sym(mat_point), A, B
 end


##########################
###### Run model ##########
##########################
using Distributions

p = 500
n = 20

my_mvn = MvNormal(eye(p))
X = rand(my_mvn, n)'

@time test = BB_est(X, 500,  .95)

#2

Hi,

You’re much more likely to get good answers if you:

  • Quote all your code using triple backticks

  • indent it properly

  • profile it first to get an idea of the bottlenecks so that it is clearer which steps are limiting

From what I can see by looking at the code, you have explicit inv, which are usually better avoided, and matlab-style code with masks and indices arrays, which are not very idiomatic and julia and might be faster with explicit loops or views/dots. It all depends on where your bottleneck is.


#3

So here, it appears by profiling (@profile and the ProfileView package) there are two main bottlenecks: matrix inversion and inference near temp1 = post_prob(A[:,i]). So first you need to figure out why post_prob is not type-stable (see the performance tips in the docs, and use the @code_warntype macro). Then, if you know d is full-rank, so that d'd is SPD, you can use a more efficient cholesky factorization to compute the inverse. If you can use the Cholesky factorization directly without computing the inverse explicitly, that’s even better.

edit: the fancy indexing also takes time: this kind of programming is really the only viable one in R/Matlab, but it’s not the case in Julia, rewriting it with loops will speed up things


#4

On the explicit inv, one simple trick for positive definite matrices PD is to just use inv(cholfact(PD)) (there is also inv(Symmetric(PD)), but that doesn’t take advantage of positive definiteness, and was slower in a basic test). This is equivalent to chol2inv(chol(PD)) in R.

julia> PD = randn(3000,2000) |> x -> x' * x;

julia> using BenchmarkTools

julia> @benchmark inv($PD)
BenchmarkTools.Trial: 
  memory estimate:  64.48 MiB
  allocs estimate:  13
  --------------
  minimum time:     596.901 ms (1.87% GC)
  median time:      702.844 ms (1.32% GC)
  mean time:        704.770 ms (5.65% GC)
  maximum time:     837.746 ms (17.26% GC)
  --------------
  samples:          8
  evals/sample:     1

@benchmark inv(cholfact($PD))
BenchmarkTools.Trial: 
  memory estimate:  61.04 MiB
  allocs estimate:  11
  --------------
  minimum time:     365.522 ms (0.31% GC)
  median time:      458.306 ms (0.47% GC)
  mean time:        464.446 ms (5.83% GC)
  maximum time:     604.741 ms (22.65% GC)
  --------------
  samples:          11
  evals/sample:     1

Edit: if you want to go lower level in your LAPACK interface:

using Compat, Compat.LinearAlgebra
pdinv(PD) = pdinv!(copy(PD))
function pdinv!(PD)
    LAPACK.potrf!('U', PD)
    LAPACK.potri!('U', PD)
    Symmetric(PD)
end

and then:

julia> @benchmark pdinv($PD)
BenchmarkTools.Trial: 
  memory estimate:  30.52 MiB
  allocs estimate:  6
  --------------
  minimum time:     246.885 ms (0.13% GC)
  median time:      249.740 ms (0.25% GC)
  mean time:        255.489 ms (2.29% GC)
  maximum time:     353.000 ms (25.33% GC)
  --------------
  samples:          20
  evals/sample:     1

#5

Just some helpful links to follow up on what @antoine-levitt suggested:


#6

It’s interesting that your pdinv is faster than inv(cholfact()), although they both use the same potri and potrf routines. Profiling it, it looks like copies do take some time, even on big matrices like this where the time to do a cholesky factorization should dominate. There’s also a ishermitian check that takes a non-trivial amount of time. It’s pretty amazing, and a good demonstration of both the FLOPs/memory bandwidth gap and the skill of BLAS programmers, that a full Cholesky factorization, which should be about 2k flops more than a copy, is only about 3x slower.


#7

I am in the process of exploring these options.

I was curious if using @parallel would also be a viable option


#8

I have updated the code. I have followed the performance tips, and have not seen much improvement, which I am sure is due to having zero Julia experience.