Extremely slow OLS simulation >1h

Hello,

I am new to Julia so I put this in this topic instead of others that might be more relevant.
I am trying to run a standard OLS simulation. That is, I generate a data frame, run an OLS, repeat for an N number of simulations, capture the values and see how they evolve as the sample size gets big. However, my code is extremely slow. About 100 minutes! Can someone please help me figure out what is causing the performance issues? Alternatively, what would be the “Julia way” of running this simulation?

P.S: In convergence(a,b) a small (a,b), like 1:10, is relatively fast. For my desired (a,b)= (1, 10^5), It takes almost 1h30 to generate the output vector!!!

using Distributions, Plots, StatsPlots, Random, GLM, DataFrames, Gadfly

 p=1/365
default(fmt=:png)
# _A: GENERATING THE DATA

function gen_data(n, α, β) #Takes 3 inputs, n sample size, α and β parameters of y= α+βx
                            #returns a dataframe of rand variables of interest

    x= rand(Binomial(1,p),n)  # x_i ~ Bernoulli(p) with p = 1/365
    ϵ=randn(n)              # Simulate ε_i~N(0,1)
    y= α .+ β*x + ϵ
    DataFrame(
    x=x,
    ϵ=ϵ,
    y=y
    )
end
#OLS
function ols(data)  #runs an OLS, returns β̂
    β̂= lm(@formula(y ~ x), data) #OLS
    coef(β̂)[2] #Returning β̂
end

# Repeating X times
function simulations(x,n, α, β)
    β̂_collect=zeros(x) #this creates vector of dimension x= number of simulations
    for i in 1:x
        data=gen_data(n, α, β)
        β̂_collect[i]= ols(data) #populates a vector with β̂, at position i,  β̂ of simulatio i
    end
    β̂_collect

end

#Asymptotic Properties of E(β̂) and Var(β̂)
function convergence(a,b)
    expected_value=[]
    expected_var=[]
    for n in a:b
        exp=simulations(1000,n,0,1)
        push!(expected_value,mean(exp))
        push!(expected_var, var(exp))
    end
    [expected_value, expected_var]
end

result=convergence(300, 10^4)
plot(result)
1 Like

Have you tried profiling to see where the time is actually spent? If you are in VSCode, use the @profview macro in front of the function call.

I suspect you are allocating many many gigabytes, and that most of the time is spent allocating memory and garbage-collecting it.

There are ways of speeding this up an order of magnitude, but you realize that you’re running 10 million ols regressions, many of which will have a collinearity problem, right?

1 Like

To get the OLS slope coefficient, you can just use the formula
(x.-mean(x))\y
which saves having to create a dataframe and calling lm(). It does require
using LinearAlgebra, Statistics

1 Like

This may not be what you had in mind, but on my machine using all cores specified with the command line option -t numberofphysicalcores the code below computes 10 million OLS regressions with on average 5000 observations in about 50 seconds. There undoubtedly are further and more elegant optimizations to be had.

The main things I changed are:

  1. making p a constant
  2. not generating many dataframes
  3. simplifying the computation of the ols estimator
  4. not reallocating data many times
  5. some simple parallelization

(As an aside, for plotting, you really don’t need to compute the numbers for every sample size)

Hth

using Distributions

const p=1/365  # this will produce lots of Infs or NaNs

function gen_data!(x, y, α, β) 
    for i ∈ eachindex( x )
        x[i]= rand(Binomial(1,p))  
        y[i]= α + β*x[i] + randn()
    end
end


ols(y,x) = cov(x,y) / var(x)


function simulation( x, y, α, β) 
    gen_data!( x,y, α, β)
    return ols(y,x)
end

function simulations(m,n, α, β)
    mn = vr = 0.0 
    x = zeros( Int, n )
    y = zeros( n )
    for i in 1:m
        β̂ = simulation(x,y,α,β) #populates a vector with β̂, at position i,  β̂ of simulatio i
        mn += β̂
        vr += β̂ * β̂
    end
    mn /= m
    vr = vr / m - mn^2
    mn, vr
end

function convergence(a,b)
    expected_value=zeros(b-a+1)
    expected_var=zeros(b-a+1)
    @Threads.threads :dynamic for n in a:b
        expected_value[n-a+1], expected_var[n-a+1] = simulations(1000,n,0,1)
    end
    [expected_value, expected_var]
end

@time result=convergence(300, 10000)  
# purists would like to run this twice to remove compilation overhead from comparisons, but second order here
2 Likes

This was extremely helpful, thanks a lot! It also sent me down a multithreading rabbit hole. What a fascinating topic!
With some minor tweaks, I worked around the NaNs problem too.

1 Like

Fair point, it’s not orthogonal to the purposes of my simulation though.