Extremely slow OLS simulation >1h

This may not be what you had in mind, but on my machine using all cores specified with the command line option -t numberofphysicalcores the code below computes 10 million OLS regressions with on average 5000 observations in about 50 seconds. There undoubtedly are further and more elegant optimizations to be had.

The main things I changed are:

  1. making p a constant
  2. not generating many dataframes
  3. simplifying the computation of the ols estimator
  4. not reallocating data many times
  5. some simple parallelization

(As an aside, for plotting, you really don’t need to compute the numbers for every sample size)

Hth

using Distributions

const p=1/365  # this will produce lots of Infs or NaNs

function gen_data!(x, y, α, β) 
    for i ∈ eachindex( x )
        x[i]= rand(Binomial(1,p))  
        y[i]= α + β*x[i] + randn()
    end
end


ols(y,x) = cov(x,y) / var(x)


function simulation( x, y, α, β) 
    gen_data!( x,y, α, β)
    return ols(y,x)
end

function simulations(m,n, α, β)
    mn = vr = 0.0 
    x = zeros( Int, n )
    y = zeros( n )
    for i in 1:m
        β̂ = simulation(x,y,α,β) #populates a vector with β̂, at position i,  β̂ of simulatio i
        mn += β̂
        vr += β̂ * β̂
    end
    mn /= m
    vr = vr / m - mn^2
    mn, vr
end

function convergence(a,b)
    expected_value=zeros(b-a+1)
    expected_var=zeros(b-a+1)
    @Threads.threads :dynamic for n in a:b
        expected_value[n-a+1], expected_var[n-a+1] = simulations(1000,n,0,1)
    end
    [expected_value, expected_var]
end

@time result=convergence(300, 10000)  
# purists would like to run this twice to remove compilation overhead from comparisons, but second order here
2 Likes