Improvement on appending data using for loop

FlatWhite17 · March 7, 2021, 4:42pm

Hi there,

I’m from a non-CS background and new to Julia. My current project involves a matching process where for each treated unit (D=1), I need to find units in the donor space (D=0) that have variance of their outcome distance across a pre-period (before the treated unit gets treated) smaller than a threshold (e.g. var(y_j - y_i < 0.005).
My codes are posted below, it works out but very slow. Basically I did the above matching for one treated unit, and it returns a DataFrame called “matched” with the treated unit and all matched donor units. Then I write the below function to do the same matching process on other treated units and keep appending result to the “matched” DataFrame. In total I have 3644 treated units. I checked my CPU usage was only around 20%, though I’m not sure if this process can use multithreading.
Any advice on how to improve upon it is appreciated, thanks!

jamblejoe · March 14, 2021, 10:46am

Hi @FlatWhite17 , welcome to the Julia discourse.
Some advice for future posts:

Dont post pictures of your code, but copy it into the code environment ``` ```. When asking to make something more performant people might want to quickly check your code snippets on their computer and do not want to hand copy code.
Make it as easy as possible for other people to read, understand and execute your code snippets on their machines. Therefore, reduce your snippet as much as possible and avoid variables you did not define in your post. The best would be to provide a minimal working example, people can quickly copy paste into a Julia session.

You already have divided your function iter_append into parts marked by comments. Break down your function into smaller parts and see which one is consuming the most time/memory. You can do that quick and dirty with @time begin ... end. If you have found the lines which consume the most time you can benchmark them separately, best with the package BenchmarkTools and @btime or @benchmark. This helps you write minimal working examples as well. Alternatively, you can use a profiler, e.g. Profile.

Henrique_Becker · March 14, 2021, 2:05pm

In the first line:

tr_all = filter!(row -> !ismissing(fac_treated[!, outcome]), fac_treated)

You take a row argument that you use for nothing. Are you sure this is what you want?

dmolina · March 14, 2021, 4:07pm

Following @Henrique_Becker, I suggest to use the optional column to dropmissing:

tr_all = dropmissing(fac_treated, :outcome)

you’re code do the operation for each row, so it is very expensive.
I suggest you to review the DataFrame documentation. For instance, you can obtain the mean (be careful, I did not check it):

base_yr = combine(filter(:ACCPT_ID => ==(fac), tr_all), :T0 => mean)[1,1]

FlatWhite17 · March 14, 2021, 4:26pm

Thanks! @dmolina! That’s very helpful.

I followed your suggestion and also saved dataframe from each iteration in an array and then use vcat( ) to combine them (instead of append in each iteration). The speed is up a lot.

Topic		Replies	Views
How to speed up the for-loop with dataframe access Performance dataframes	25	1177	April 14, 2022
Append!() function extremely slow in DataFrames + CSV Data package	14	6143	January 16, 2018
Push! is very slow on Vector{Union{...}} Internals & Design	2	836	July 12, 2018
How do I make the julia code efficient? General Usage question	3	304	September 21, 2022
Help with first time working on performance New to Julia performance	6	1012	April 14, 2017

Improvement on appending data using for loop

Related topics