I have two dataframes, df_A, and df_B, each containing stock data (of different kinds) and each containing a column of stock symbols. However, df_A is longer than df_B, because it contains more stock symbols than df_B.
How do I delete the rows in df_A that contain stock symbols not found in df_B?
The problem I have is that the dataframes are very large and my system runs out of memory during an inner join. I was hoping to delete rows that are not necessary, and then proceed to a step where I join the two dataframes.
Are there repeated stock symbols in df_b? If so, the result of an innerjoin like this can be bigger than the original (which is consistent with your out of memory error). You can check this by adding a validate=(false, true) to the innerjoin. One way I have gotten around this is to create a DataFrame based on df_b that is just the unique stock symbols before the join:
subset!(df_A,
:stocksyms => s -> (!isnothing).(indexin(s, df_B.stocksyms)))
indexin returns nothing in places where its first argument has an element that isn’t in the second argument, so this filters the dataframe based on that.