I’m new to Julia and programming in general. I need to run many linear regressions using exactly the same large dataset 5GB+. I’ve wrote a code to run each regression over different workers. Essentially, I’m copying the dataframe into each worker() and running a loop with parallel. However, this process consumes a lot of memory and sometimes Julia run out of memory.
Is there a way to load the dataframe on a single worker and allow other workers() to use it as read-only?
I’d appreciate any other tips on how to improve the code as well. Thanks a lot in advance
My code is the following:
addprocs(12)
nprocs()
using JuliaDB, CSV
@everywhere using DataFrames, RDatasets, FixedEffectModels, JLD #, FileIO, StatFiles, DataFrames, CSV, JuliaDB
df1 = @time JLD.load("iv_sample_baseline_p2_julia.jld","df1")
# making df1 available in all processors
nprocs()
for i in 1:nprocs()
@spawnat i df1
end
cd("C:/Users/admin-britto/Dropbox/Brazil_Emp_dur/Empirical/Program_State_9/Julia/paralell/project1")
@everywhere include("regression.jl") # define regression function
@everywhere include("definitions.jl") # define global variables used in each worker
# multi processor
nprocs()
res = SharedArray{Float64}(1000,6)
tic()
@parallel for n in 1:36
i = mod(n-1,6)+1
j = div(n-1,6)+1
control = eval(parse("c$i"))
w = eval(parse("w$j"))
a = "FixedEffectModels.reg(df1, @model($dep ~ $w $control + (ben_inf ~ iv_pred_w), fe = yearfe , vcov = cluster(pisfe) ))"
eval(parse("b = $a"))
res[n,1] = round(coef(b)[length(coef(b))],8)
res[n,2] = round(vcov(b)[length(coef(b))^2]^(1/2),8)
res[n,3] = round(coef(b)[length(coef(b))]/(vcov(b)[length(coef(b))^2]^(1/2)),3)
res[n,4] = nobs(b)
res[n,5] = i
res[n,6] = j
end
@everywhere fetch(res)
toc()
´´´