Ways to speed up this code

Hey everybody,

I want to run this code on a relatively large dataset consisting of daily firm-level, industry-level and market-level data. This code takes long to run, because I run a simple regression for every firm-month. The dataset consists of roughly 60 million rows.

function CalcNonsynchronicityDaily!(df::DataFrame, df_monthly::DataFrame, firms)
        for firm in firms
                @show firm
                # Number of months for a firm
                months = unique(df[df.permno .== firm, :yearmonth])
                @show n = length(months)
                info    = Array{Union{Missing, Float64}}(undef, n, 1)
                # do not include the month if the share traded for less than 15 days
                for i in 1:n
                    month = months[i]
                    #@show month
                    if length(df[(df.permno .== firm) .& (df.yearmonth .== month), :ret]) >= 10
                                    res = lm(@formula(ret ~ reti + retm), df[(df.permno .== firm) .& (df.yearmonth .== month), :])
                                    info[i] = 1 - r2(res)
                                    info[i] = missing
                        info[i] = missing
                #@show "done"
                df_monthly[df_monthly.permno .== firm, :info] .= info[:]

The try catch is there because in some cases there might be a problem with collinearity, but with at least 10 observations this really should not be an issue.

I guess I could move this line outside of the function:

months = unique(df[df.permno .== firm, :yearmonth])

and I could skip this check and instead try to run the regression for any number of days in a month:

length(df[(df.permno .== firm) .& (df.yearmonth .== month), :ret]) >= 10

I am not too comfortable otherwise with using DataFrames efficiently. Are there any suggestions?

Thanks a lot!

1 Like

I am not too comfortable otherwise with using DataFrames efficiently. Are there any suggestions?

I’m not sure if this is much help but I really like using JuliaDB in place of DataFrames for really large datasets. I routinely work with datasets that have tens of millions of rows and hundreds of columns in JuliaDB (along with Distributed.jl) and it has been really fast for me.


Have you profiled this to check where the bottlenecks are? It seems to me that the lm call would be expensive enough that you don’t need to worry about whether you’re traversing the DataFrame efficiently.

Is there a reason why you need to run regressions on subsamples rather than using the whole data set with appropriate dummies?

I sliced the dataframe, such that I pass for example 100 firms at a time, such that df[(df.permno .== firm) .& (df.yearmonth .== month), :] does not search through 60 million lines for every regression. I also took out this line

if length(df[(df.permno .== firm) .& (df.yearmonth .== month), :ret]) >= 10

Together this brought about a 20-30x speedup.

I’ll give it a spin at some point for sure, thank you for the suggestion!