 # Improving Speed for a Novice

Hi:
I am mostly a R user, but am currently developing a package that has the primary functions written in Julia. This package is for estimating very large (inverse) co-variance matrices. I have considerable speed gains compared to R, already, but as a novice I was looking for some input to make the Julia code even more efficient.

``````function col_means(x)
mean(x, 1)
end

function center_data(x)
x .- col_means(x)
end

function post_prob(x)
grt_zero = mean(repeat(, inner = size(x, 1)) .> x)
ls_zero =  1 - grt_zero
1 - min(ls_zero, grt_zero)
end

function BB_est(x, sims, prob)
X = center_data(x)
n = size(X, 1)
p = size(X, 2)
prob = convert(Float64, prob)
offdiag = (p * (p - 1)) / 2

loop_max = convert(Int64, offdiag)

temp = Exponential(1)
sims= convert(Int64, sims)
exp_drw = rand(temp, sims, n)

dir_weights = exp_drw ./sum(exp_drw, 2)

A = zeros(sims, offdiag)
B = zeros((1, offdiag))
C = zeros(sims, p)
mat_point = zeros(p,p)
mat_sig = zeros(p, p)
idx = tril(trues(size(mat_point)), -1);

for i = 1:sims
t = dir_weights[i, 1:n]
d = X .* sqrt.(t)
bb_cov = (d' * d)
inv_cov =  inv(bb_cov)
d = diag(inv_cov)
inv_cov[diagind(inv_cov)] = 0

test = inv_cov[idx]
rmv_0 = filter!(x->x≠ 0,test)
A[i,:] = rmv_0
C[i,:] = d
end

for i in 1:loop_max
temp1 = post_prob(A[:,i])
B[1,i] = ifelse(temp1 > prob, 1, 0)
end

mat_sig[idx] = B
mat_point[idx] = mean(A, 1)
mat_point[diagind(mat_point)] = mean(C, 1)
force_sym(mat_sig), force_sym(mat_point), A, B
end

##########################
###### Run model ##########
##########################
using Distributions

p = 500
n = 20

my_mvn = MvNormal(eye(p))
X = rand(my_mvn, n)'

@time test = BB_est(X, 500,  .95)
``````

Hi,

You’re much more likely to get good answers if you:

• Quote all your code using triple backticks

• indent it properly

• profile it first to get an idea of the bottlenecks so that it is clearer which steps are limiting

From what I can see by looking at the code, you have explicit `inv`, which are usually better avoided, and matlab-style code with masks and indices arrays, which are not very idiomatic and julia and might be faster with explicit loops or views/dots. It all depends on where your bottleneck is.

1 Like

So here, it appears by profiling (`@profile` and the `ProfileView` package) there are two main bottlenecks: matrix inversion and inference near `temp1 = post_prob(A[:,i])`. So first you need to figure out why `post_prob` is not type-stable (see the performance tips in the docs, and use the `@code_warntype` macro). Then, if you know `d` is full-rank, so that `d'd` is SPD, you can use a more efficient cholesky factorization to compute the inverse. If you can use the Cholesky factorization directly without computing the inverse explicitly, that’s even better.

edit: the fancy indexing also takes time: this kind of programming is really the only viable one in R/Matlab, but it’s not the case in Julia, rewriting it with loops will speed up things

2 Likes

On the explicit `inv`, one simple trick for positive definite matrices `PD` is to just use `inv(cholfact(PD))` (there is also `inv(Symmetric(PD))`, but that doesn’t take advantage of positive definiteness, and was slower in a basic test). This is equivalent to `chol2inv(chol(PD))` in `R`.

``````julia> PD = randn(3000,2000) |> x -> x' * x;

julia> using BenchmarkTools

julia> @benchmark inv(\$PD)
BenchmarkTools.Trial:
memory estimate:  64.48 MiB
allocs estimate:  13
--------------
minimum time:     596.901 ms (1.87% GC)
median time:      702.844 ms (1.32% GC)
mean time:        704.770 ms (5.65% GC)
maximum time:     837.746 ms (17.26% GC)
--------------
samples:          8
evals/sample:     1

@benchmark inv(cholfact(\$PD))
BenchmarkTools.Trial:
memory estimate:  61.04 MiB
allocs estimate:  11
--------------
minimum time:     365.522 ms (0.31% GC)
median time:      458.306 ms (0.47% GC)
mean time:        464.446 ms (5.83% GC)
maximum time:     604.741 ms (22.65% GC)
--------------
samples:          11
evals/sample:     1
``````

Edit: if you want to go lower level in your LAPACK interface:

``````using Compat, Compat.LinearAlgebra
pdinv(PD) = pdinv!(copy(PD))
function pdinv!(PD)
LAPACK.potrf!('U', PD)
LAPACK.potri!('U', PD)
Symmetric(PD)
end
``````

and then:

``````julia> @benchmark pdinv(\$PD)
BenchmarkTools.Trial:
memory estimate:  30.52 MiB
allocs estimate:  6
--------------
minimum time:     246.885 ms (0.13% GC)
median time:      249.740 ms (0.25% GC)
mean time:        255.489 ms (2.29% GC)
maximum time:     353.000 ms (25.33% GC)
--------------
samples:          20
evals/sample:     1
``````

It’s interesting that your `pdinv` is faster than `inv(cholfact())`, although they both use the same `potri` and `potrf` routines. Profiling it, it looks like copies do take some time, even on big matrices like this where the time to do a cholesky factorization should dominate. There’s also a `ishermitian` check that takes a non-trivial amount of time. It’s pretty amazing, and a good demonstration of both the FLOPs/memory bandwidth gap and the skill of BLAS programmers, that a full Cholesky factorization, which should be about 2k flops more than a copy, is only about 3x slower.