This question is very close to this Stackoverflow question. The primary issue is merging two dataframes with a restriction based on date ranges. This is a common issue in financial data, in my case, stock returns.
The easiest way to do this is to simply join the dataframes and then filter:
df = join(df1, df2, on=:id, kind=:left)
df = df[df[:, :dateStart] .<= df[:, :date] .<= df[:, :dateEnd], :]
While this works fine for small datasets, stock return data tends to be large. In my case, I am trying to merge a dataframe with 20 million rows to a dataframe with 80 thousand rows, which takes up huge amounts of RAM. So, my initial question would be is there an easier way to do this in Julia?
I tried the Query.jl package, but it seems to join first and filter second similar to my initial code.
Another option, as suggested by the Stackoverflow answer, is to use SQLite. So, using SQLite.jl, I tried to run the code. It works but for some reason the filtering is not correct. I tried to replicate this with simpler data, but could not, so I donโt know why it doesnโt work. I uploaded the CSV data here: data. The code I am using is:
df1 = CSV.read("df1.csv")
df2 = CSV.read("df2.csv")
db = SQLite.DB();
SQLite.load!(df1, db, "df1")
SQLite.load!(df2, db, "df2")
sqlcode = """
select df1.permno, df1.dateStart, df1.dateEnd, df2.retDate
from df1
left join df2 on df1.permno=df2.permno
where df1.dateStart <= df2.retDate and df1.dateEnd >= df2.retDate"""
dfOut = SQLite.Query(db, sqlcode) |> DataFrame
Each observation in df1 is a firm (permno) with a date range, which is 5 business days long. The other dataframe, df2, would include the observations (retDate). Joining the two should result in 50 observations, but results in 605. When doing a similar code in python I get the correct 50 results:
df1 = pd.read_csv('df1.csv')
df2 = pd.read_csv('df2.csv')
conn = sqlite3.connect(':memory:')
df1.to_sql('df1', conn, index=False)
df2.to_sql('df2', conn, index=False)
sqlcode = """
select df1.permno, df1.dateStart, df1.dateEnd, df2.retDate
from df1
left join df2 on df1.permno=df2.permno
where df1.dateStart <= df2.retDate and df1.dateEnd >= df2.retDate"""
dfOut = pd.read_sql_query(sqlcode, conn)
Any advice on what I am doing wrong or what Julia is doing differently?