Deleting rows from a dataframe containing stock symbols not found in another dataframe

Nash · September 7, 2022, 12:34pm

I have two dataframes, df_A, and df_B, each containing stock data (of different kinds) and each containing a column of stock symbols. However, df_A is longer than df_B, because it contains more stock symbols than df_B.

How do I delete the rows in df_A that contain stock symbols not found in df_B?

junder873 · September 7, 2022, 12:42pm

Probably the easiest way is using innerjoin, e.g.:

df_a = DataFrame(ticker=[:a, :b, :c], ret=randn(3))

df_b = DataFrame(ticker=[:a, :b], price=[2, 3])

innerjoin(
    df_a,
    df_b,
    on=:ticker
)
# 2×3 DataFrame
#  Row │ ticker  ret          price 
#      │ Symbol  Float64      Int64 
# ─────┼────────────────────────────
#    1 │ a       -0.0907065       2 
#    2 │ b        0.00463602      3

Nash · September 7, 2022, 12:46pm

The problem I have is that the dataframes are very large and my system runs out of memory during an inner join. I was hoping to delete rows that are not necessary, and then proceed to a step where I join the two dataframes.

junder873 · September 7, 2022, 12:58pm

Are there repeated stock symbols in df_b? If so, the result of an innerjoin like this can be bigger than the original (which is consistent with your out of memory error). You can check this by adding a validate=(false, true) to the innerjoin. One way I have gotten around this is to create a DataFrame based on df_b that is just the unique stock symbols before the join:

df_c = unique(df_b[:, [:ticker]])

innerjoin(
    df_a,
    df_c,
    on=:ticker,
    validate=(false, true)
)

digital_carver · September 7, 2022, 5:26pm

subset!(df_A, 
        :stocksyms => s -> (!isnothing).(indexin(s, df_B.stocksyms)))

indexin returns nothing in places where its first argument has an element that isn’t in the second argument, so this filters the dataframe based on that.

jules · September 7, 2022, 5:52pm

Or just make a set of the symbols separately and then filter based on that.
Something like (untested):

syms = Set(df_B.stocksyms)
subset!(df_A, :stocksyms => ss -> [s in syms for s in ss])

Topic		Replies	Views
Delete all rows contained in a dataframe, as specified by an array of ids New to Julia	3	322	March 10, 2021
Dataframe delete duplicate with condition New to Julia dataframes	2	2319	September 25, 2019
How to delete rows in DataFrame? New to Julia question , dataframes	4	3567	September 12, 2022
How to remove rows containing missing from DataFrame? New to Julia	6	13116	July 22, 2019
Join two dataframes with two common column headers New to Julia	1	347	November 29, 2022

Deleting rows from a dataframe containing stock symbols not found in another dataframe

Related topics