Simulation of regression coefficients

I’m still very much a novice in Julia, but I’m already using it productively for many tasks. In my research area, R and Stata are the main alternatives, so Julia really has a number of advantages in terms of speed and syntax. (Great job everyone!)

I wrote a short blog post as a response to this interesting post on coding up efficient simulations in R, in order to show how nicely Julia handles this task. Maybe some of you are interested in this, and I’d be interested in learning whether there are any obvious improvements to my code.

2 Likes

This is the fastest I can come up with on the CPU (before I go to sleep). You might be able to squeeze more out of it on a GPU.

On my computer, my version is 4x faster than the fastest version in your blog yet. :slight_smile:

But the idea is that you can do multi-threading. And you avoid memory allocating in gen_data by preallocating matrices and just re-using them.

But for multi-threading, you are better off creating multiple copies of the data so your threads can work concurrently and don’t have to share data (leading to false-sharing).

Also, you can save time by not going through with DataFrames and just create the matrix directly.

Also, if you use GPU you might want to turn off multi-threading.

function gen_data2!(X, y, x1_dmean, x2_dmean)
  ## total time periods in the the panel = 500
  tt = 500

  # x1 and x2 covariates
  x1_A = 1 .+ rand(std_normal, tt)
  x1_B = 1/4 .+ rand(std_normal, tt)
  x2_A = 1 .+ x1_A .+ rand(std_normal, tt)
  x2_B = 1 .+ x1_B .+ rand(std_normal, tt)

  # outcomes (notice different slope coefs for x2_A and x2_B)
  y_A = x1_A .+ 1*x2_A + rand(std_normal, tt)
  y_B = x1_B .+ 2*x2_B + rand(std_normal, tt)
    x1 = vcat(x1_A, x1_B)
    x2 = vcat(x2_A, x2_B)

    X[:, 2] .= vcat(fill(0, length(x1_A)), fill(1, length(x1_B)))
    X[:, 3] .= x1
    X[:, 4] .= x2
    X[:, 5] .= (x1 .* x2)

    y .= vcat(y_A, y_B)

    x1_dmean .= vcat(x1_A .- mean(x1_A), x1_B .- mean(x1_B))
    x2_dmean .= vcat(x2_A .- mean(x2_A), x2_B .- mean(x2_B))
end

function coefs_lm_formula4(X, y, x1_dmean, x2_dmean)
  mod_level = fastfit(X, y)
  @inbounds X[:, 5] .= x1_dmean .* x2_dmean
  mod_dmean = fastfit(X, y)
  (mod_level[end], mod_dmean[end])
end

function run_simulations5(nsim)
  sims = zeros(nsim, 2);
  X = [zeros(Float64, 1000, 5) for _ in 1:nthreads()]
  y = [zeros(Float64, 1000) for _ in 1:nthreads()]
  x1_dmean = [zeros(Float64, 1000) for _ in 1:nthreads()]
  x2_dmean = [zeros(Float64, 1000) for _ in 1:nthreads()]

  # set constant
  for i in 1:nthreads()
    X[i][:, 1] .= 1
  end

  @threads for i in 1:nsim
    gen_data2!(X[threadid()], y[threadid()], x1_dmean[threadid()], x2_dmean[threadid()])
    @inbounds sims[i, 1], sims[i, 2] = coefs_lm_formula4(X[threadid()], y[threadid()], x1_dmean[threadid()], x2_dmean[threadid()])
  end
  sims
end
2 Likes

Thanks, very interesting!
I hadn’t thought about preallocating the matrices and vectors this way, but that seems like a low hanging fruit. The main performance benefit of your version seems to come from the multithreading, though. I think that’s the part where the readability takes a relatively strong hit, but, of course, if time is of the essence, that’s an obvious way to go. A version with preallocation, but without multithreading, runs in about 1.4s on my machine, so it’s a small improvement over the last version on the blog.

I wonder if someone has made a macro for that.

also, the other issue is allocation. The code seems to be allocating a lot. So I’d look at these lines and see if allocation can be reduce by consolidating some of them

1 Like

Someone posted a version of Twitter that reduces allocations a lot if you’re interested: https://twitter.com/elbersb/status/1409454534666207232/retweets/with_comments

Nice. But I would say allocation seems to be main unresolved issue with the CPU version.

Wonder if anyone will attempt a GPU version?

Yeah, would be interesting. As I wrote in the blog post, I’m not so interested in getting the fastest version possible, although it’s fun to try to improve the speed of course :slight_smile:

1 Like

I get the following timings:

VERSION = v"1.6.1"
n = 20000

Original version (with DataFrame, single thread):
  1.836 s (2040002 allocations: 4.65 GiB)

Optimized version (with DataFrame, single thread):
  918.817 ms (160062 allocations: 5.94 MiB)

Multi-thread optimized version (with DataFrame): 
Threads.nthreads() = 12
  168.350 ms (160786 allocations: 7.56 MiB)

Multi-thread non-dataframe version (run_simulations5): 
Threads.nthreads() = 12
  714.319 ms (760130 allocations: 2.64 GiB)

Optimized non-dataframe version (single thread): 
  869.904 ms (160017 allocations: 5.90 MiB)

Multi-thread optimized non-dataframe version: 
Threads.nthreads() = 12
  161.730 ms (160246 allocations: 6.97 MiB)

Source code: Getting even faster by reducing allocations and by multi-threading.ipynb (nbviewer)

On my laptop computer, run_simulations5 with 12 threads is 2.5 times faster than the original.

The reducing-alloctions optimized version with single thread is 2 times faster than the original. Thus, in this case on my laptop computer, reducing allocations is almost as efficient as multi-threading.

The reducing-alloctions optimized version with 12 threads is more than 10 times faster than the original.

I enjoyed the Julia game of reducing allocations and multi-threading. It is very fun!

5 Likes