Why is my code very slow

dazhwu · June 19, 2023, 10:17pm

My code attached is long because I couldn’t figure out which part should be improved. I guess the bottleneck is function rls_multi which implements a recursive OLS. Note that this function was translated from C++ (armadillo) code in R package exuber:

github.com

kvasilopoulos/exuber/blob/master/src/rls_gsadf.cpp

//' @export
// [[Rcpp::depends(RcppArmadillo)]]
#include <RcppArmadillo.h>

using namespace Rcpp;
using namespace arma;

// [[Rcpp::export]]
arma::vec rls_gsadf(const arma::mat & yxmat, int min_win, int lag = 0) {

  int start = min_win;
  int end = yxmat.n_rows;
  int total = end - start + 1;

  arma::mat tstat = zeros<mat>(total, total);
  tstat.fill(arma::datum::nan);

  if (lag == 0) {

    arma::vec y = yxmat.col(0);

This file has been truncated. show original

I have read the performance tips thoroughly. But it is still very slow. Running the same task in R takes less than 10 seconds, but running the following Julia code takes 3 minutes. Any suggestion would be greatly appreciated.

using LinearAlgebra
using Statistics
using Random

function default_win_size(obs::Int64)
    r0 = 0.01 + 1.8 / sqrt(obs)
    return floor(Int64, r0 * obs)
end

function get_reg(y::AbstractVector{Float64}, lag::Int64,  null_hyp=false) #where {F<:Float64}

    T = length(y)    
    
    lag_y = @view(y[lag+1:T-1])  
    
    dy=diff(y)
    dy01 = @view(dy[lag+1:end])  #              @view(y[lag+2:end]) - @view(y[lag+1:end-1])
    
    start_col=ifelse(null_hyp, 1, 2)
    reg = Matrix{Float64}(undef, size(dy01,1), start_col + lag)
    reg[:, start_col] .=1
    if null_hyp == false
        reg[:,1] =lag_y
    end       
    
    @inbounds @simd for i in 1:lag        
        reg[:, i+start_col] = @view(dy[lag+1-i:end-i])
    end

    return reg, dy01
end

function rls_multi(y::AbstractVector{Float64}, x::Matrix{Float64}, badf::Vector{Float64}, bsadf::Vector{Float64},  min_win::Int64)    #where {F<:Float64}

    obs = size(y,1)
    x_width = size(x, 2)  # the number of independent variables

    t_x = x'       

    temp_t=Vector{Float64}(undef, length(bsadf))

    local g,b  # g is a square matrix of size x_width, b is a column vector of size x_width

    @inbounds for r2 in min_win:obs        
        @inbounds for r1 in r2-min_win+1:-1:1      
            win_size = r2 - r1 + 1      
            tsx = @view t_x[:, r1:r2]
            sx = tsx'
            sy = @view y[r1:r2]
            if r1 == r2 - min_win + 1
                g = inv(tsx * sx)
                b = g * (tsx * sy)
            else
                t_new_x = @view tsx[:, [1]]
                new_x = @view sx[[1], :]
                gx = g * t_new_x
                the_factor = 1 / (1 + (new_x*gx)[1])
                g -= the_factor * (gx * gx')                
                b -= g * t_new_x * (dot(b, t_new_x) - sy[1])                                               
            end            

            res = sy - sx*b
            sqres = (res'res)[1]

            vares::Float64 = sqres / (win_size - x_width)
            #sb_1::Float64 = sqrt(vares * diag(g)[1])
            sb_1::Float64 = sqrt(vares * g[1,1])            
            temp_t[r1] = (b[1]) / sb_1  
            if r1==1 
                badf[r2-min_win+1] = temp_t[r1]
            end

        end
        bsadf[r2-min_win+1] = maximum(@view temp_t[1:r2-min_win+1])#[1]
    end
end

function rls_gsadf(y::AbstractVector{Float64}, adflag::Int64, min_win::Int64) 
    obs_ori = length(y)    
    
    obs = obs_ori - 1 - adflag    # length(dy01)    

    max_start::Int64 = obs_ori - adflag - min_win   #max_starting 

    x, dy01=get_reg(y, adflag)

    badf=Vector{Float64}(undef, max_start)
    bsadf=Vector{Float64}(undef, max_start)

    if adflag == 0
       
       rls_zero(dy01, x[:,1], max_start, min_win, obs) 
    else                      
       rls_multi(dy01, x, badf, bsadf,  min_win)
    end    

    return (badf, bsadf)

end


function CV_PSY(T::Int, nboot::Int, min_window::Int, adflag::Int)
    # MC simulation to get critical values
    min_window=ifelse(min_window==0, default_win_size(T), min_window)
    qe = [0.90 0.95 0.99]

    dim = T - adflag - min_window
    MPSY = Matrix{Float64}(undef, dim, nboot)
    tbr = Matrix{Float64}(undef, dim, length(qe))

    seed = 123
    
    Random.seed!(seed)
    e = randn(T, nboot)
    a = T^(-1)

    y = cumsum(e .+ a, dims=1)
    

    @inbounds @simd  for iter ∈ 1:nboot
        if mod(iter, 100) == 0
            println(iter)
        end                       
        badf, _ =rls_gsadf(@view(y[:,iter]), adflag, min_window)  
        MPSY[:, iter] = accumulate(max, badf)
    end

    @inbounds @simd for i in 1:dim
        @inbounds for j in eachindex(qe)
            tbr[i, j] = quantile(@view(MPSY[i, :]), qe[j])
        end
    end
    return (tbr)
end

start_time = now()
tbr=CV_PSY(300, 1000, 0, 1)      
println("Elapsed time: ", now() - start_time)

TheLateKronos · June 19, 2023, 10:27pm

That is a lot of code! Have you tried profiling it, to a) identify which parts dominate the runtime, and b) to look for runtime dispatch? If not, see GitHub - timholy/ProfileView.jl: Visualization of Julia profiling data

Also, are you making sure to run the function more than once, so as to not include the first-time-only compilation time? E.g. using @time wil report the time to run, but also the percentage of time spent compiling vs running the code.

tchebycheff · June 22, 2023, 9:55am

You might have to rewrite instead of literal line by line translation from C++.

This is from just using GLM package.

@time for j = 1:100
    for i = 1:1000
        m = lm(randn(300,2), rand(300))
        coef(m) ./ stderror(m)
    end
end

0.728653 seconds (3.10 M allocations: 2.494 GiB, 22.46% gc time)

danielw2904 · June 22, 2023, 9:39pm

Have you tried the code without views? It could improve speed if the views lead to non contiguous memory access in the operations (I don’t fully understand the code so not sure if that’s the case)

https://docs.julialang.org/en/v1/manual/performance-tips/#Copying-data-is-not-always-bad

lmiq · June 22, 2023, 10:09pm

What’s your intention with this notation? Don’t you want just sx[1,:]? (Is it something reminiscent from MATLAB?)

These operations, among others, are creating intermediate arrays, and are within the loop, thus probably causing a lot of unnecessary allocations.

Same for many Matrix products there, using views doesn’t help much if the result of the operations with them have to be allocated as new arrays.

I have the impression that the code could be much faster, but that it would require a bit of work inside the double loop to preallocate all intermediate arrays, and then use in place computations.

dazhwu · June 22, 2023, 10:27pm

Thank you for your suggestions. I ran the profile tool and found that matrix multiplications cost a lot of time. So, I made the following changes:

(1) use matmul function from package Octavian to do multiplications.
(2) rotate the matrix of independent variables by 90 degrees, so that each time period is a column rather than a row. Because in recursive regressions, data are cut based on time periods.
(3) when calling a function, make sure to pass matrix/vector, rather than abstract matrix or abstract vector. This means when calculating the corresponding matrix or vector, do NOT use @view.
(4) replace the following line:

@inbounds @simd for iter ∈ 1:nboot

to

    Threads.@threads for iter ∈ 1:nboot

These changes reduce total running time from 180 seconds to 40 seconds, which is good enough. I also tried to write function rls_gsadf using Eigen C++, and then call it through CxxWrap. I haven’t done code optimization but the total running time has been reduced to 6 seconds. I will try to improve the Julia version of rls_gsadf to close to 10 seconds.

dazhwu · June 22, 2023, 10:40pm

Thank you. But there is no package to do recursive regression. It works like this: suppose you have 100 rows of data from t=1 to t=100 and t represents month. Suppose you choose a minimum window of 10. Then you run a sequence of regressions and each involves a window with at least 10 observations. The first regression uses data from t=1 to t=10 (10 observations). The second regression is from t=1 to t=11, then the 3rd from t=2 to t=11. Regression four is from t=1 to t=12, then from t=2 to t=12, so on and so forth. The last regression is from t=91 to t=100. In total, there are 4186 (=1+2+3+…+91) regressions. Doing 4186 regressions individually would not be efficient.

dazhwu · June 22, 2023, 10:46pm

Yes. I changed some of the code. For example, I changed

dy01=@view(dy[lag+1:end])
return reg, dy01

to

return reg, dy[lag+1:end]

dazhwu · June 22, 2023, 11:11pm

At the beginning, I wrote sx[1,:] and then multiplied it by a matrix. I got an error because though sx is a matrix, sx[1,:] is “downgraded” to a vector. I have to write sx[[1],:] to make it still a matrix.

For

            res = sy - sx*b
            sqres = (res'res)[1]

, do you mean it is better to change it to:

sqres=dot(sy-sx*b, sy-sx*b)

? but in such as case wouldn’t sy-sx*b be calculated twice before the dot product?

lmiq · June 23, 2023, 12:43am

No, not really. I’m trying to say (maybe with a bad example, I thought that sx*b was a matrix product there, but it isn’t, that you should try, if it is worth the hassle, to make your operations in-place, and preallocate intermediate arrays. The performance there is suboptimal because of allocations:

With a smaller problem and commenting the print:

julia> using BenchmarkTools

julia> @btime tbr=CV_PSY(30, 1000, 0, 1);
  646.794 ms (4015073 allocations: 10.68 GiB)

You see that the code allocates millions of things to up to 10GiB of memory. There is where the perfomance problem lies.

To improve on that, you should check whether these allocations happen, starting from the most inner operations.

Then I made this simple modification, allocating g and b arrays outside the inner double loop:

    local g = zeros(x_width,x_width)
    local b = zeros(x_width)
    # g is a square matrix of size x_width, b is a column vector of size x_width

which then are updated in-place:

                g .= inv(tsx * sx)
                b .= g * (tsx * sy)
...
                g .-= the_factor * (gx * gx')
                b .-= g * t_new_x * (dot(b, t_new_x) - sy[1])

and then we get:

julia> @btime CV_PSY(30, 1000, 0, 1);
  491.181 ms (3124073 allocations: 10.63 GiB)

Note that we have improved the time in ~25% and reduced a million of allocations. This is by avoiding the allocation of new arrays on every iteration of the most inner loop.

If you manage to remove all allocations from that loop, performance will be probably much better.

But that will give you some work, since, for instance, something like tsx * sx above is allocating a new intermediate array, and that should be replaced by LinearAlgebra.mul!(output, tsx, sx) with a preallocated output array. And this is one example of many that would have to be dealt with there to improve, a lot, the performance. As I said, the code would result different from what you have written, and it depends on how fast that has to be.

Salmon · June 23, 2023, 1:26pm

Do you mean you need a 1xN matrix?
You can get that with sx[1:1,:]

lmiq · June 23, 2023, 1:36pm

It may be (from quick look at the code), that the OP wants a row-vector because later it will be used to compute a dot product. If that is the case, he should just use sx[1,:] and compute the dot product with sx' * y, or dot(sx, y), instead of using sx * y.

fatteneder · June 23, 2023, 2:01pm

@simd, @inbounds is pretty useless in the outer loops if all the time is spent inside rls_gsadf, accumulate, quantile. And it makes sense why using Threads.@threads instead gives a speed up, because you divide that work between threads then.

tbeason · June 23, 2023, 2:05pm

Even if you decide not to go with calling GLM.jl methods, still worth looking at what they do. I naively thought I could make a faster & lightweight version at one point – I couldn’t. It is just that fast.

dazhwu · June 23, 2023, 9:58pm

Thank you for your suggestions. The performance statistics of my initial code are:
232.916165 seconds (773.03 M allocations: 2.112 TiB, 26.80% gc time)

In the new version, they are improved to:
45.155297 seconds (387.93 M allocations: 193.650 GiB, 37.09% gc time, 2.48% compilation time)

using LinearAlgebra
using Statistics
using Random
using Octavian

function default_win_size(obs::Int64)
    r0 = 0.01 + 1.8 / sqrt(obs)
    return floor(Int64, r0 * obs)
end

@inbounds function get_reg(y::Vector{F}, lag::Int64, null_hyp=false) where {F<:Float64}
    #each row is a variable and each col is an observation

    dy::Vector{F} = diff(y)

    start_row = ifelse(null_hyp, 1, 2)
    
    reg = Matrix{F}(undef, start_row + lag, length(y) -1  - lag)  #column-major order    

    for i in eachcol(reg)    # eachc1:ncol   #column

        if null_hyp == false
            reg[1, i] = y[lag+i]
        end
        reg[start_row, i] = 1

        for j in 1:lag
            reg[j+start_row, i] = dy[lag+i-j]  #y[lag+1+i-j] - y[lag+i-j]
        end
    end

    return reg, dy[lag+1:end]
end

@inbounds function get_reg(y::Vector{F}, lag::Int64, null_hyp=false) where {F<:Float64}
    #each row is a variable and each col is an observation

    dy::Vector{F} = diff(y)

    start_row = ifelse(null_hyp, 1, 2)
    ncol = length(y) - 1 - lag
    reg = Matrix{F}(undef, start_row + lag, ncol)  #column-major order    

    for i in 1:ncol  #eachcol(reg)    # eachc1:ncol   #column

        if null_hyp == false
            reg[1, i] = y[lag+i]
        end
        reg[start_row, i] = 1

        for j in 1:lag
            reg[j+start_row, i] = dy[lag+i-j]  #y[lag+1+i-j] - y[lag+i-j]
        end
    end

    return reg, dy[lag+1:end]
end

function rls_multi!(y::Vector{F}, t_x::Matrix{F}, badf::Vector{F}, bsadf::Vector{F}, min_win::Int64) where {F<:Float64}

    obs = length(y)
    x_width = size(t_x, 1)  # the number of independent variables    

    temp_t = Vector{F}(undef, length(bsadf))

    #local g, b  # g is a square matrix of size x_width, b is a column vector of size x_width
    local g = Matrix{F}(undef, x_width, x_width)
    local b = Vector{F}(undef, x_width)
    local gx = Matrix{F}(undef, x_width, 1)

    @inbounds for r2 in min_win:obs
        @inbounds @simd for r1 in r2-min_win+1:-1:1
            win_size = r2 - r1 + 1
            tsx = t_x[:, r1:r2]
            #sx = tsx'
            sy = y[r1:r2]
            if win_size == min_win

                g .= inv(matmul(tsx, tsx')) #inv(tsx * sx) # tsx')
                b .= matmul(g, matmul(tsx, sy))
            else
                t_new_x = t_x[:, r1:r1] # an x_width by 1 matrix
                gx .= matmul(g, t_new_x)
                the_factor = 1 / (1 + matmul(t_new_x', gx)[1])
                g .-= the_factor * matmul(gx,gx')
                b .-= matmul(g, t_new_x) * (dot(b, t_new_x) - y[r1])
            end

            res = sy - matmul(tsx', b)  # b' * @view(t_x[:, r1:r2])   #res = sy - sx * b
            sqres = dot(res,res)    #dot(res, res) #


            vares::F = sqres / (win_size - x_width)
            
            temp_t[r1] = (b[1]) / (sqrt(vares * g[1, 1]))
            if r1 == 1
                badf[r2-min_win+1] = temp_t[r1]
            end

        end
        bsadf[r2-min_win+1] = maximum(@view temp_t[1:r2-min_win+1])#[1]
    end
end

function rls_gsadf(y::Vector{F}, adflag::Int64, min_win::Int64) where {F<:Float64}
    
    obs_ori = length(y)   

    max_start::Int64 = obs_ori - adflag - min_win   #max_starting 

    badf = Vector{F}(undef, max_start)
    bsadf = Vector{F}(undef, max_start)

    x, dy01 = get_reg(y, adflag)
    rls_multi!(dy01, x, badf, bsadf, min_win)

    return (badf, bsadf)

end


@inbounds function CV_PSY(T::Int64, nboot::Int64, min_window::Int64, adflag::Int64)

    min_window= ifelse(min_window == 0, default_win_size(T), min_window)
    qe = [0.90 0.95 0.99]

    dim::Int64 = T - adflag - min_window
    MPSY = Matrix{Float64}(undef, dim, nboot)
    tbr = Matrix{Float64}(undef, dim, length(qe))
    badf = Matrix{Float64}(undef, dim, nboot)

    seed = 123

    Random.seed!(seed)

    Threads.@threads for iter ∈ 1:nboot
        e::Vector{Float64} = randn(T - adflag) .+ 1.0 / T
        y::Vector{Float64} = cumsum(e)

        badf[:,iter] , _= rls_gsadf(y, 0, min_window)
        
        if mod(iter, 100) == 0
            println(iter)
        end
    end
    
    accumulate!(max, MPSY, badf, dims=1)   
    
    Threads.@threads for i in 1:dim
        for j in eachindex(qe)
            tbr[i, j] = quantile((MPSY[i, :]), qe[j])
        end
    end
    
    return (tbr)
end

@time tbr = CV_PSY(300, 1000, 0, 1)

Henrique_Becker · June 23, 2023, 10:34pm

I think the number of allocations is still very high and the GC time seems to indicate there is a lot of churn there.

Sometimes a code allocate a lot of memory and that is impossible to reduce, but to allocated such memory in hundreds of millions of small allocations is a code smell, and the GC time kinda confirms it. If ~30% improvement is still relevant, probably it would be worth to pin down what is allocating so many times and try to preallocate it.

dazhwu · June 23, 2023, 11:08pm

I tried GLM you suggested. Code will be easier to read:

    @inbounds for r2 in min_win:obs
        @inbounds  for r1 in r2-min_win+1:-1:1
            
            sx= x[r1:r2, :]              
            sy = y[r1:r2]
            m = lm(sx, sy)
            
            temp_t[r1] = coef(m)[1] ./ stderror(m)[1]
            if r1 == 1
                badf[r2-min_win+1] = temp_t[r1]
            end

        end
        bsadf[r2-min_win+1] = maximum(@view temp_t[1:r2-min_win+1])#[1]
    end

But it would take twice as much time, compared with the recursive OLS version. Because the most time consuming part is calculating the inverse of matrix. In a data set with 100 observations and minimum window size being 10, matrix inverse will be calculated 4186 times. With recursive OLS, only 91 times will be needed.

Topic		Replies	Views
Improving Speed for a Novice Performance	7	2189	May 14, 2018
OpenBLAS: Julia slower than R Performance linearalgebra	41	7738	March 26, 2019
Julia vs R vs Python Community performance	106	27996	January 13, 2019
Cool Case of Faster than C++ performance General Usage	3	1444	September 25, 2019
Why is this code slow? General Usage	9	1504	May 21, 2019

Why is my code very slow

Related topics