Deleting rows from a dataframe containing stock symbols not found in another dataframe

I have two dataframes, df_A, and df_B, each containing stock data (of different kinds) and each containing a column of stock symbols. However, df_A is longer than df_B, because it contains more stock symbols than df_B.

How do I delete the rows in df_A that contain stock symbols not found in df_B?

Probably the easiest way is using innerjoin, e.g.:

df_a = DataFrame(ticker=[:a, :b, :c], ret=randn(3))

df_b = DataFrame(ticker=[:a, :b], price=[2, 3])

# 2×3 DataFrame
#  Row │ ticker  ret          price 
#      │ Symbol  Float64      Int64 
# ─────┼────────────────────────────
#    1 │ a       -0.0907065       2 
#    2 │ b        0.00463602      3

The problem I have is that the dataframes are very large and my system runs out of memory during an inner join. I was hoping to delete rows that are not necessary, and then proceed to a step where I join the two dataframes.

Are there repeated stock symbols in df_b? If so, the result of an innerjoin like this can be bigger than the original (which is consistent with your out of memory error). You can check this by adding a validate=(false, true) to the innerjoin. One way I have gotten around this is to create a DataFrame based on df_b that is just the unique stock symbols before the join:

df_c = unique(df_b[:, [:ticker]])

    validate=(false, true)
        :stocksyms => s -> (!isnothing).(indexin(s, df_B.stocksyms)))

indexin returns nothing in places where its first argument has an element that isn’t in the second argument, so this filters the dataframe based on that.

Or just make a set of the symbols separately and then filter based on that.
Something like (untested):

syms = Set(df_B.stocksyms)
subset!(df_A, :stocksyms => ss -> [s in syms for s in ss])