Ways to speed up this code

IljaK91 · September 17, 2019, 8:01am

Hey everybody,

I want to run this code on a relatively large dataset consisting of daily firm-level, industry-level and market-level data. This code takes long to run, because I run a simple regression for every firm-month. The dataset consists of roughly 60 million rows.


function CalcNonsynchronicityDaily!(df::DataFrame, df_monthly::DataFrame, firms)
        for firm in firms
                @show firm
                # Number of months for a firm
                months = unique(df[df.permno .== firm, :yearmonth])
                @show n = length(months)
                info    = Array{Union{Missing, Float64}}(undef, n, 1)
                # do not include the month if the share traded for less than 15 days
                for i in 1:n
                    month = months[i]
                    #@show month
                    if length(df[(df.permno .== firm) .& (df.yearmonth .== month), :ret]) >= 10
                                try
                                    res = lm(@formula(ret ~ reti + retm), df[(df.permno .== firm) .& (df.yearmonth .== month), :])
                                    info[i] = 1 - r2(res)
                                catch
                                    info[i] = missing
                                end
                    else
                        info[i] = missing
                    end
                end
                #@show "done"
                df_monthly[df_monthly.permno .== firm, :info] .= info[:]
        end
end

The try catch is there because in some cases there might be a problem with collinearity, but with at least 10 observations this really should not be an issue.

I guess I could move this line outside of the function:

months = unique(df[df.permno .== firm, :yearmonth])

and I could skip this check and instead try to run the regression for any number of days in a month:

length(df[(df.permno .== firm) .& (df.yearmonth .== month), :ret]) >= 10

I am not too comfortable otherwise with using DataFrames efficiently. Are there any suggestions?

Thanks a lot!

mthelm85 · September 17, 2019, 11:47am

I am not too comfortable otherwise with using DataFrames efficiently. Are there any suggestions?

I’m not sure if this is much help but I really like using JuliaDB in place of DataFrames for really large datasets. I routinely work with datasets that have tens of millions of rows and hundreds of columns in JuliaDB (along with Distributed.jl) and it has been really fast for me.

nilshg · September 17, 2019, 1:45pm

Have you profiled this to check where the bottlenecks are? It seems to me that the lm call would be expensive enough that you don’t need to worry about whether you’re traversing the DataFrame efficiently.

Is there a reason why you need to run regressions on subsamples rather than using the whole data set with appropriate dummies?

IljaK91 · September 17, 2019, 2:32pm

I sliced the dataframe, such that I pass for example 100 firms at a time, such that df[(df.permno .== firm) .& (df.yearmonth .== month), :] does not search through 60 million lines for every regression. I also took out this line

if length(df[(df.permno .== firm) .& (df.yearmonth .== month), :ret]) >= 10
end

Together this brought about a 20-30x speedup.

IljaK91 · September 17, 2019, 3:57pm

I’ll give it a spin at some point for sure, thank you for the suggestion!

Topic		Replies	Views
Speeding up a function Performance performance	43	923	August 14, 2023
Using a DataFrame to calculate another column in a separate DataFrame Performance performance , dataframes	6	810	June 10, 2020
Help with performance tuning this dataframe aggregation Performance	10	723	September 23, 2018
Parallelization for Large DataFrames Data parallel , dataframes	2	113	August 4, 2024
Looping over previous row efficiency New to Julia loops , dataframes	1	429	February 3, 2022

Ways to speed up this code

Related topics