Extremely slow OLS simulation >1h

Mouad · August 21, 2022, 5:56pm

Hello,

I am new to Julia so I put this in this topic instead of others that might be more relevant.
I am trying to run a standard OLS simulation. That is, I generate a data frame, run an OLS, repeat for an N number of simulations, capture the values and see how they evolve as the sample size gets big. However, my code is extremely slow. About 100 minutes! Can someone please help me figure out what is causing the performance issues? Alternatively, what would be the “Julia way” of running this simulation?

P.S: In convergence(a,b) a small (a,b), like 1:10, is relatively fast. For my desired (a,b)= (1, 10^5), It takes almost 1h30 to generate the output vector!!!

using Distributions, Plots, StatsPlots, Random, GLM, DataFrames, Gadfly

 p=1/365
default(fmt=:png)
# _A: GENERATING THE DATA

function gen_data(n, α, β) #Takes 3 inputs, n sample size, α and β parameters of y= α+βx
                            #returns a dataframe of rand variables of interest

    x= rand(Binomial(1,p),n)  # x_i ~ Bernoulli(p) with p = 1/365
    ϵ=randn(n)              # Simulate ε_i~N(0,1)
    y= α .+ β*x + ϵ
    DataFrame(
    x=x,
    ϵ=ϵ,
    y=y
    )
end
#OLS
function ols(data)  #runs an OLS, returns β̂
    β̂= lm(@formula(y ~ x), data) #OLS
    coef(β̂)[2] #Returning β̂
end

# Repeating X times
function simulations(x,n, α, β)
    β̂_collect=zeros(x) #this creates vector of dimension x= number of simulations
    for i in 1:x
        data=gen_data(n, α, β)
        β̂_collect[i]= ols(data) #populates a vector with β̂, at position i,  β̂ of simulatio i
    end
    β̂_collect

end

#Asymptotic Properties of E(β̂) and Var(β̂)
function convergence(a,b)
    expected_value=[]
    expected_var=[]
    for n in a:b
        exp=simulations(1000,n,0,1)
        push!(expected_value,mean(exp))
        push!(expected_var, var(exp))
    end
    [expected_value, expected_var]
end

result=convergence(300, 10^4)
plot(result)

TheLateKronos · August 21, 2022, 10:53pm

Have you tried profiling to see where the time is actually spent? If you are in VSCode, use the @profview macro in front of the function call.

I suspect you are allocating many many gigabytes, and that most of the time is spent allocating memory and garbage-collecting it.

Joris_Pinkse · August 22, 2022, 1:20am

There are ways of speeding this up an order of magnitude, but you realize that you’re running 10 million ols regressions, many of which will have a collinearity problem, right?

mcreel · August 22, 2022, 2:05am

To get the OLS slope coefficient, you can just use the formula
(x.-mean(x))\y
which saves having to create a dataframe and calling lm(). It does require
using LinearAlgebra, Statistics

Joris_Pinkse · August 22, 2022, 2:51am

This may not be what you had in mind, but on my machine using all cores specified with the command line option -t numberofphysicalcores the code below computes 10 million OLS regressions with on average 5000 observations in about 50 seconds. There undoubtedly are further and more elegant optimizations to be had.

The main things I changed are:

making p a constant
not generating many dataframes
simplifying the computation of the ols estimator
not reallocating data many times
some simple parallelization

(As an aside, for plotting, you really don’t need to compute the numbers for every sample size)

Hth

using Distributions

const p=1/365  # this will produce lots of Infs or NaNs

function gen_data!(x, y, α, β) 
    for i ∈ eachindex( x )
        x[i]= rand(Binomial(1,p))  
        y[i]= α + β*x[i] + randn()
    end
end


ols(y,x) = cov(x,y) / var(x)


function simulation( x, y, α, β) 
    gen_data!( x,y, α, β)
    return ols(y,x)
end

function simulations(m,n, α, β)
    mn = vr = 0.0 
    x = zeros( Int, n )
    y = zeros( n )
    for i in 1:m
        β̂ = simulation(x,y,α,β) #populates a vector with β̂, at position i,  β̂ of simulatio i
        mn += β̂
        vr += β̂ * β̂
    end
    mn /= m
    vr = vr / m - mn^2
    mn, vr
end

function convergence(a,b)
    expected_value=zeros(b-a+1)
    expected_var=zeros(b-a+1)
    @Threads.threads :dynamic for n in a:b
        expected_value[n-a+1], expected_var[n-a+1] = simulations(1000,n,0,1)
    end
    [expected_value, expected_var]
end

@time result=convergence(300, 10000)  
# purists would like to run this twice to remove compilation overhead from comparisons, but second order here

Mouad · August 22, 2022, 11:28pm

This was extremely helpful, thanks a lot! It also sent me down a multithreading rabbit hole. What a fascinating topic!
With some minor tweaks, I worked around the NaNs problem too.

Mouad · August 22, 2022, 11:29pm

Fair point, it’s not orthogonal to the purposes of my simulation though.

Topic		Replies	Views
Simulation of regression coefficients New to Julia first-steps , dataframes	7	568	June 29, 2021
Performance regression in 1.0.1 Performance	5	879	October 2, 2018
Monte Carlo particle simulations are very slow Performance	9	975	May 11, 2020
Slow performance with Julia on optimization problems Performance jump , parallel , optimization , data_structures	2	1008	April 3, 2020
Very slow array creation (and Benchmark iterate error) Performance	6	1091	May 15, 2019

Extremely slow OLS simulation >1h

Related topics