Improvement on appending data using for loop

Hi there,

I’m from a non-CS background and new to Julia. My current project involves a matching process where for each treated unit (D=1), I need to find units in the donor space (D=0) that have variance of their outcome distance across a pre-period (before the treated unit gets treated) smaller than a threshold (e.g. var(y_j - y_i < 0.005).
My codes are posted below, it works out but very slow. Basically I did the above matching for one treated unit, and it returns a DataFrame called “matched” with the treated unit and all matched donor units. Then I write the below function to do the same matching process on other treated units and keep appending result to the “matched” DataFrame. In total I have 3644 treated units. I checked my CPU usage was only around 20%, though I’m not sure if this process can use multithreading.
Any advice on how to improve upon it is appreciated, thanks!

Hi @FlatWhite17 , welcome to the Julia discourse.
Some advice for future posts:

  • Dont post pictures of your code, but copy it into the code environment ``` ```. When asking to make something more performant people might want to quickly check your code snippets on their computer and do not want to hand copy code.
  • Make it as easy as possible for other people to read, understand and execute your code snippets on their machines. Therefore, reduce your snippet as much as possible and avoid variables you did not define in your post. The best would be to provide a minimal working example, people can quickly copy paste into a Julia session.

You already have divided your function iter_append into parts marked by comments. Break down your function into smaller parts and see which one is consuming the most time/memory. You can do that quick and dirty with @time begin ... end. If you have found the lines which consume the most time you can benchmark them separately, best with the package BenchmarkTools and @btime or @benchmark. This helps you write minimal working examples as well. Alternatively, you can use a profiler, e.g. Profile.

3 Likes

In the first line:

tr_all = filter!(row -> !ismissing(fac_treated[!, outcome]), fac_treated)

You take a row argument that you use for nothing. Are you sure this is what you want?

1 Like

Following @Henrique_Becker, I suggest to use the optional column to dropmissing:

tr_all = dropmissing(fac_treated, :outcome)

you’re code do the operation for each row, so it is very expensive.
I suggest you to review the DataFrame documentation. For instance, you can obtain the mean (be careful, I did not check it):

base_yr = combine(filter(:ACCPT_ID => ==(fac), tr_all), :T0 => mean)[1,1]
2 Likes

Thanks! @dmolina! That’s very helpful.

I followed your suggestion and also saved dataframe from each iteration in an array and then use vcat( ) to combine them (instead of append in each iteration). The speed is up a lot.