# Why is my code very slow

My code attached is long because I couldn’t figure out which part should be improved. I guess the bottleneck is function rls_multi which implements a recursive OLS. Note that this function was translated from C++ (armadillo) code in R package exuber:

I have read the performance tips thoroughly. But it is still very slow. Running the same task in R takes less than 10 seconds, but running the following Julia code takes 3 minutes. Any suggestion would be greatly appreciated.

``````using LinearAlgebra
using Statistics
using Random

function default_win_size(obs::Int64)
r0 = 0.01 + 1.8 / sqrt(obs)
return floor(Int64, r0 * obs)
end

function get_reg(y::AbstractVector{Float64}, lag::Int64,  null_hyp=false) #where {F<:Float64}

T = length(y)

lag_y = @view(y[lag+1:T-1])

dy=diff(y)
dy01 = @view(dy[lag+1:end])  #              @view(y[lag+2:end]) - @view(y[lag+1:end-1])

start_col=ifelse(null_hyp, 1, 2)
reg = Matrix{Float64}(undef, size(dy01,1), start_col + lag)
reg[:, start_col] .=1
if null_hyp == false
reg[:,1] =lag_y
end

@inbounds @simd for i in 1:lag
reg[:, i+start_col] = @view(dy[lag+1-i:end-i])
end

return reg, dy01
end

obs = size(y,1)
x_width = size(x, 2)  # the number of independent variables

t_x = x'

local g,b  # g is a square matrix of size x_width, b is a column vector of size x_width

@inbounds for r2 in min_win:obs
@inbounds for r1 in r2-min_win+1:-1:1
win_size = r2 - r1 + 1
tsx = @view t_x[:, r1:r2]
sx = tsx'
sy = @view y[r1:r2]
if r1 == r2 - min_win + 1
g = inv(tsx * sx)
b = g * (tsx * sy)
else
t_new_x = @view tsx[:, ]
new_x = @view sx[, :]
gx = g * t_new_x
the_factor = 1 / (1 + (new_x*gx))
g -= the_factor * (gx * gx')
b -= g * t_new_x * (dot(b, t_new_x) - sy)
end

res = sy - sx*b
sqres = (res'res)

vares::Float64 = sqres / (win_size - x_width)
#sb_1::Float64 = sqrt(vares * diag(g))
sb_1::Float64 = sqrt(vares * g[1,1])
temp_t[r1] = (b) / sb_1
if r1==1
end

end
end
end

obs_ori = length(y)

obs = obs_ori - 1 - adflag    # length(dy01)

max_start::Int64 = obs_ori - adflag - min_win   #max_starting

rls_zero(dy01, x[:,1], max_start, min_win, obs)
else
end

end

# MC simulation to get critical values
min_window=ifelse(min_window==0, default_win_size(T), min_window)
qe = [0.90 0.95 0.99]

dim = T - adflag - min_window
MPSY = Matrix{Float64}(undef, dim, nboot)
tbr = Matrix{Float64}(undef, dim, length(qe))

seed = 123

Random.seed!(seed)
e = randn(T, nboot)
a = T^(-1)

y = cumsum(e .+ a, dims=1)

@inbounds @simd  for iter ∈ 1:nboot
if mod(iter, 100) == 0
println(iter)
end
end

@inbounds @simd for i in 1:dim
@inbounds for j in eachindex(qe)
tbr[i, j] = quantile(@view(MPSY[i, :]), qe[j])
end
end
return (tbr)
end

start_time = now()
tbr=CV_PSY(300, 1000, 0, 1)
println("Elapsed time: ", now() - start_time)
``````

That is a lot of code! Have you tried profiling it, to a) identify which parts dominate the runtime, and b) to look for runtime dispatch? If not, see GitHub - timholy/ProfileView.jl: Visualization of Julia profiling data

Also, are you making sure to run the function more than once, so as to not include the first-time-only compilation time? E.g. using `@time` wil report the time to run, but also the percentage of time spent compiling vs running the code.

3 Likes

You might have to rewrite instead of literal line by line translation from C++.

This is from just using GLM package.

``````@time for j = 1:100
for i = 1:1000
m = lm(randn(300,2), rand(300))
coef(m) ./ stderror(m)
end
end

0.728653 seconds (3.10 M allocations: 2.494 GiB, 22.46% gc time)
``````

Have you tried the code without views? It could improve speed if the views lead to non contiguous memory access in the operations (I don’t fully understand the code so not sure if that’s the case)

What’s your intention with this notation? Don’t you want just `sx[1,:]`? (Is it something reminiscent from MATLAB?)

These operations, among others, are creating intermediate arrays, and are within the loop, thus probably causing a lot of unnecessary allocations.

Same for many Matrix products there, using views doesn’t help much if the result of the operations with them have to be allocated as new arrays.

I have the impression that the code could be much faster, but that it would require a bit of work inside the double loop to preallocate all intermediate arrays, and then use in place computations.

Thank you for your suggestions. I ran the profile tool and found that matrix multiplications cost a lot of time. So, I made the following changes:

(1) use matmul function from package Octavian to do multiplications.
(2) rotate the matrix of independent variables by 90 degrees, so that each time period is a column rather than a row. Because in recursive regressions, data are cut based on time periods.
(3) when calling a function, make sure to pass matrix/vector, rather than abstract matrix or abstract vector. This means when calculating the corresponding matrix or vector, do NOT use @view.
(4) replace the following line:

``````@inbounds @simd for iter ∈ 1:nboot
``````

to

``````    Threads.@threads for iter ∈ 1:nboot
``````

These changes reduce total running time from 180 seconds to 40 seconds, which is good enough. I also tried to write function rls_gsadf using Eigen C++, and then call it through CxxWrap. I haven’t done code optimization but the total running time has been reduced to 6 seconds. I will try to improve the Julia version of rls_gsadf to close to 10 seconds.

1 Like

Thank you. But there is no package to do recursive regression. It works like this: suppose you have 100 rows of data from t=1 to t=100 and t represents month. Suppose you choose a minimum window of 10. Then you run a sequence of regressions and each involves a window with at least 10 observations. The first regression uses data from t=1 to t=10 (10 observations). The second regression is from t=1 to t=11, then the 3rd from t=2 to t=11. Regression four is from t=1 to t=12, then from t=2 to t=12, so on and so forth. The last regression is from t=91 to t=100. In total, there are 4186 (=1+2+3+…+91) regressions. Doing 4186 regressions individually would not be efficient.

1 Like

Yes. I changed some of the code. For example, I changed

``````dy01=@view(dy[lag+1:end])
return reg, dy01
``````

to

``````return reg, dy[lag+1:end]
``````

At the beginning, I wrote sx[1,:] and then multiplied it by a matrix. I got an error because though sx is a matrix, sx[1,:] is “downgraded” to a vector. I have to write sx[,:] to make it still a matrix.

For

``````            res = sy - sx*b
sqres = (res'res)
``````

, do you mean it is better to change it to:

``````sqres=dot(sy-sx*b, sy-sx*b)
``````

? but in such as case wouldn’t sy-sx*b be calculated twice before the dot product?

No, not really. I’m trying to say (maybe with a bad example, I thought that `sx*b` was a matrix product there, but it isn’t, that you should try, if it is worth the hassle, to make your operations in-place, and preallocate intermediate arrays. The performance there is suboptimal because of allocations:

With a smaller problem and commenting the `print`:

``````julia> using BenchmarkTools

julia> @btime tbr=CV_PSY(30, 1000, 0, 1);
646.794 ms (4015073 allocations: 10.68 GiB)
``````

You see that the code allocates millions of things to up to 10GiB of memory. There is where the perfomance problem lies.

To improve on that, you should check whether these allocations happen, starting from the most inner operations.

Then I made this simple modification, allocating `g` and `b` arrays outside the inner double loop:

``````    local g = zeros(x_width,x_width)
local b = zeros(x_width)
# g is a square matrix of size x_width, b is a column vector of size x_width
``````

which then are updated in-place:

``````                g .= inv(tsx * sx)
b .= g * (tsx * sy)
...
g .-= the_factor * (gx * gx')
b .-= g * t_new_x * (dot(b, t_new_x) - sy)
``````

and then we get:

``````julia> @btime CV_PSY(30, 1000, 0, 1);
491.181 ms (3124073 allocations: 10.63 GiB)
``````

Note that we have improved the time in ~25% and reduced a million of allocations. This is by avoiding the allocation of new arrays on every iteration of the most inner loop.

If you manage to remove all allocations from that loop, performance will be probably much better.

But that will give you some work, since, for instance, something like `tsx * sx` above is allocating a new intermediate array, and that should be replaced by `LinearAlgebra.mul!(output, tsx, sx)` with a preallocated `output` array. And this is one example of many that would have to be dealt with there to improve, a lot, the performance. As I said, the code would result different from what you have written, and it depends on how fast that has to be.

Do you mean you need a 1xN matrix?
You can get that with `sx[1:1,:]`

It may be (from quick look at the code), that the OP wants a row-vector because later it will be used to compute a dot product. If that is the case, he should just use `sx[1,:]` and compute the dot product with `sx' * y`, or `dot(sx, y)`, instead of using `sx * y`.

`@simd, @inbounds` is pretty useless in the outer loops if all the time is spent inside `rls_gsadf, accumulate, quantile`. And it makes sense why using `Threads.@threads` instead gives a speed up, because you divide that work between threads then.

1 Like

Even if you decide not to go with calling GLM.jl methods, still worth looking at what they do. I naively thought I could make a faster & lightweight version at one point – I couldn’t. It is just that fast.

Thank you for your suggestions. The performance statistics of my initial code are:
232.916165 seconds (773.03 M allocations: 2.112 TiB, 26.80% gc time)

In the new version, they are improved to:
45.155297 seconds (387.93 M allocations: 193.650 GiB, 37.09% gc time, 2.48% compilation time)

``````using LinearAlgebra
using Statistics
using Random
using Octavian

function default_win_size(obs::Int64)
r0 = 0.01 + 1.8 / sqrt(obs)
return floor(Int64, r0 * obs)
end

@inbounds function get_reg(y::Vector{F}, lag::Int64, null_hyp=false) where {F<:Float64}
#each row is a variable and each col is an observation

dy::Vector{F} = diff(y)

start_row = ifelse(null_hyp, 1, 2)

reg = Matrix{F}(undef, start_row + lag, length(y) -1  - lag)  #column-major order

for i in eachcol(reg)    # eachc1:ncol   #column

if null_hyp == false
reg[1, i] = y[lag+i]
end
reg[start_row, i] = 1

for j in 1:lag
reg[j+start_row, i] = dy[lag+i-j]  #y[lag+1+i-j] - y[lag+i-j]
end
end

return reg, dy[lag+1:end]
end

@inbounds function get_reg(y::Vector{F}, lag::Int64, null_hyp=false) where {F<:Float64}
#each row is a variable and each col is an observation

dy::Vector{F} = diff(y)

start_row = ifelse(null_hyp, 1, 2)
ncol = length(y) - 1 - lag
reg = Matrix{F}(undef, start_row + lag, ncol)  #column-major order

for i in 1:ncol  #eachcol(reg)    # eachc1:ncol   #column

if null_hyp == false
reg[1, i] = y[lag+i]
end
reg[start_row, i] = 1

for j in 1:lag
reg[j+start_row, i] = dy[lag+i-j]  #y[lag+1+i-j] - y[lag+i-j]
end
end

return reg, dy[lag+1:end]
end

obs = length(y)
x_width = size(t_x, 1)  # the number of independent variables

#local g, b  # g is a square matrix of size x_width, b is a column vector of size x_width
local g = Matrix{F}(undef, x_width, x_width)
local b = Vector{F}(undef, x_width)
local gx = Matrix{F}(undef, x_width, 1)

@inbounds for r2 in min_win:obs
@inbounds @simd for r1 in r2-min_win+1:-1:1
win_size = r2 - r1 + 1
tsx = t_x[:, r1:r2]
#sx = tsx'
sy = y[r1:r2]
if win_size == min_win

g .= inv(matmul(tsx, tsx')) #inv(tsx * sx) # tsx')
b .= matmul(g, matmul(tsx, sy))
else
t_new_x = t_x[:, r1:r1] # an x_width by 1 matrix
gx .= matmul(g, t_new_x)
the_factor = 1 / (1 + matmul(t_new_x', gx))
g .-= the_factor * matmul(gx,gx')
b .-= matmul(g, t_new_x) * (dot(b, t_new_x) - y[r1])
end

res = sy - matmul(tsx', b)  # b' * @view(t_x[:, r1:r2])   #res = sy - sx * b
sqres = dot(res,res)    #dot(res, res) #

vares::F = sqres / (win_size - x_width)

temp_t[r1] = (b) / (sqrt(vares * g[1, 1]))
if r1 == 1
end

end
end
end

obs_ori = length(y)

max_start::Int64 = obs_ori - adflag - min_win   #max_starting

end

@inbounds function CV_PSY(T::Int64, nboot::Int64, min_window::Int64, adflag::Int64)

min_window= ifelse(min_window == 0, default_win_size(T), min_window)
qe = [0.90 0.95 0.99]

dim::Int64 = T - adflag - min_window
MPSY = Matrix{Float64}(undef, dim, nboot)
tbr = Matrix{Float64}(undef, dim, length(qe))

seed = 123

Random.seed!(seed)

e::Vector{Float64} = randn(T - adflag) .+ 1.0 / T
y::Vector{Float64} = cumsum(e)

if mod(iter, 100) == 0
println(iter)
end
end

for j in eachindex(qe)
tbr[i, j] = quantile((MPSY[i, :]), qe[j])
end
end

return (tbr)
end

@time tbr = CV_PSY(300, 1000, 0, 1)

``````
1 Like

I think the number of allocations is still very high and the GC time seems to indicate there is a lot of churn there.

Sometimes a code allocate a lot of memory and that is impossible to reduce, but to allocated such memory in hundreds of millions of small allocations is a code smell, and the GC time kinda confirms it. If ~30% improvement is still relevant, probably it would be worth to pin down what is allocating so many times and try to preallocate it.

I tried GLM you suggested. Code will be easier to read:

``````    @inbounds for r2 in min_win:obs
@inbounds  for r1 in r2-min_win+1:-1:1

sx= x[r1:r2, :]
sy = y[r1:r2]
m = lm(sx, sy)

temp_t[r1] = coef(m) ./ stderror(m)
if r1 == 1