Hey everybody,
I want to run this code on a relatively large dataset consisting of daily firm-level, industry-level and market-level data. This code takes long to run, because I run a simple regression for every firm-month. The dataset consists of roughly 60 million rows.
function CalcNonsynchronicityDaily!(df::DataFrame, df_monthly::DataFrame, firms)
for firm in firms
@show firm
# Number of months for a firm
months = unique(df[df.permno .== firm, :yearmonth])
@show n = length(months)
info = Array{Union{Missing, Float64}}(undef, n, 1)
# do not include the month if the share traded for less than 15 days
for i in 1:n
month = months[i]
#@show month
if length(df[(df.permno .== firm) .& (df.yearmonth .== month), :ret]) >= 10
try
res = lm(@formula(ret ~ reti + retm), df[(df.permno .== firm) .& (df.yearmonth .== month), :])
info[i] = 1 - r2(res)
catch
info[i] = missing
end
else
info[i] = missing
end
end
#@show "done"
df_monthly[df_monthly.permno .== firm, :info] .= info[:]
end
end
The try catch is there because in some cases there might be a problem with collinearity, but with at least 10 observations this really should not be an issue.
I guess I could move this line outside of the function:
months = unique(df[df.permno .== firm, :yearmonth])
and I could skip this check and instead try to run the regression for any number of days in a month:
length(df[(df.permno .== firm) .& (df.yearmonth .== month), :ret]) >= 10
I am not too comfortable otherwise with using DataFrames efficiently. Are there any suggestions?
Thanks a lot!