Merge dataframes where one value is between two others

This question is very close to this Stackoverflow question. The primary issue is merging two dataframes with a restriction based on date ranges. This is a common issue in financial data, in my case, stock returns.

The easiest way to do this is to simply join the dataframes and then filter:

df = join(df1, df2, on=:id, kind=:left)
df = df[df[:, :dateStart] .<= df[:, :date] .<= df[:, :dateEnd], :]

While this works fine for small datasets, stock return data tends to be large. In my case, I am trying to merge a dataframe with 20 million rows to a dataframe with 80 thousand rows, which takes up huge amounts of RAM. So, my initial question would be is there an easier way to do this in Julia?

I tried the Query.jl package, but it seems to join first and filter second similar to my initial code.

Another option, as suggested by the Stackoverflow answer, is to use SQLite. So, using SQLite.jl, I tried to run the code. It works but for some reason the filtering is not correct. I tried to replicate this with simpler data, but could not, so I donโ€™t know why it doesnโ€™t work. I uploaded the CSV data here: data. The code I am using is:

df1 = CSV.read("df1.csv")
df2 = CSV.read("df2.csv")
db = SQLite.DB();
SQLite.load!(df1, db, "df1")
SQLite.load!(df2, db, "df2")
sqlcode = """
select df1.permno, df1.dateStart, df1.dateEnd, df2.retDate
from df1
left join df2 on df1.permno=df2.permno
where df1.dateStart <= df2.retDate and df1.dateEnd >= df2.retDate"""
dfOut = SQLite.Query(db, sqlcode) |> DataFrame

Each observation in df1 is a firm (permno) with a date range, which is 5 business days long. The other dataframe, df2, would include the observations (retDate). Joining the two should result in 50 observations, but results in 605. When doing a similar code in python I get the correct 50 results:

df1 = pd.read_csv('df1.csv')
df2 = pd.read_csv('df2.csv')
conn = sqlite3.connect(':memory:')
df1.to_sql('df1', conn, index=False)
df2.to_sql('df2', conn, index=False)
sqlcode = """
select df1.permno, df1.dateStart, df1.dateEnd, df2.retDate
from df1
left join df2 on df1.permno=df2.permno
where df1.dateStart <= df2.retDate and df1.dateEnd >= df2.retDate"""
dfOut = pd.read_sql_query(sqlcode, conn)

Any advice on what I am doing wrong or what Julia is doing differently?

1 Like

Not sure why sqlite doesnโ€™t work properly. Are you sure doing it in Julia isnโ€™t good enough? Hereโ€™s a naive solution:

julia> function myjoin(row, df2)
           res = filter(df2) do r
               r.permno == row.permno && 
               r.retDate >= row.dateStart &&
               r.retDate <= row.dateEnd
           end
           dfrow = DataFrame(row)
           left = reduce(vcat, (dfrow for i in eachindex(res[!,1])))
           right = res[!, 2:end]
           hcat(left, right)
       end

julia> reduce(vcat, (myjoin(r, df2) for r in eachrow(df1)))
50ร—4 DataFrame
โ”‚ Row โ”‚ permno  โ”‚ dateStart  โ”‚ dateEnd    โ”‚ retDate    โ”‚
โ”‚     โ”‚ Float64 โ”‚ Dates.Date โ”‚ Dates.Date โ”‚ Dates.Date โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ 1   โ”‚ 10001.0 โ”‚ 2010-11-12 โ”‚ 2010-11-18 โ”‚ 2010-11-12 โ”‚
โ”‚ 2   โ”‚ 10001.0 โ”‚ 2010-11-12 โ”‚ 2010-11-18 โ”‚ 2010-11-15 โ”‚
โ”‚ 3   โ”‚ 10001.0 โ”‚ 2010-11-12 โ”‚ 2010-11-18 โ”‚ 2010-11-16 โ”‚
โ‹ฎ
โ”‚ 48  โ”‚ 10001.0 โ”‚ 2016-05-13 โ”‚ 2016-05-19 โ”‚ 2016-05-17 โ”‚
โ”‚ 49  โ”‚ 10001.0 โ”‚ 2016-05-13 โ”‚ 2016-05-19 โ”‚ 2016-05-18 โ”‚
โ”‚ 50  โ”‚ 10001.0 โ”‚ 2016-05-13 โ”‚ 2016-05-19 โ”‚ 2016-05-19 โ”‚

As the join column isnโ€™t indexed, if df2 is large then it would not be very performant. Perhaps try IndexedTables or create your own Dict lookup to speed things up.

Your query is not quite correct, with this where clause, left join is effectively inner join. It is possible that there is a bug in sqlite which process this case incorrectly, so it would be more correct to set inner join from the start.

As a reference - soon (hopefully) DataFrames.jl will provide fast indexing via groupby, see https://github.com/JuliaData/DataFrames.jl/issues/2095.

3 Likes

Thank you for this suggestion. I spent some time this past week since the suggested solution was relatively slow on really large datasets (the full dataset for df1 is ~80,000 rows, df2 is about ~20 million). I wanted to post my solution since others might have similar problems:

using IndexedTables: ndsparse, rows

function myJoin(df1::DataFrame, df2::DataFrame)
    t = ndsparse(
        (permno = df2[:, :permno], retDate = df2[:, :retDate]),
        (index = 1:size(df2, 1),),
    )
    ret = DataFrame(index1 = Int[], index2 = Int[])
    for i = 1:size(df1, 1)
        res = t[
            df1[i, :permno],
            collect(df1[i, :dateStart]:Day(1):df1[i, :dateEnd])
        ]
        for v in rows(res)
            push!(ret, (i, v.index))
        end
    end
    df1[!, :index1] = 1:size(df1, 1)
    df2[!, :index2] = 1:size(df2, 1)
    df1 = join(df1, ret, on=:index1, kind=:left)
    select!(df1, Not(:permno))
    df1 = join(df1, df2, on=:index2, kind=:left)
    select!(df1, Not([:index1, :index2]))
    return df1
end

Even on the large dataset this still runs in 12 or so seconds, which is easily fast enough for what I do.

Once again, thank you for your help.

4 Likes

Is there a more elegant way to merge by range?

I have a package that should do this by reading through N numbers of dataframes just once: https://github.com/caseykneale/LockandKeyLookups.jl

Not sure if itโ€™s registered, and I donโ€™t remember exactly how it works to be honestโ€ฆ Could use performance improvements. Think the idea is it creates an iterator where each iteration is a match between a โ€œlockโ€ stream with a number of chunks to match against via some arbitrary functionโ€ฆ Maybe I shouldnโ€™t be recommending this

1 Like

Interesting discussion. I took a somewhat different approach than the solutions presented here. Itโ€™s a bit closer to an actual merge by range, and Iโ€™ve found it to be quite fast (thanks to the many performance enhancements made in DataFrames over the past few years). Please excuse the slightly different naming convention.

using DataFrames, Dates

function mergecrspcompkernel(;crsp=error("crsp is required"), comp=error("comp is required"))
  ###Merge one permno at a time, linking each crsp row to a compid (comp row)
  scrsps = groupby(crsp, :permno)
  scomps = groupby(comp, :permno)
  comp.compid = 1:size(comp,1) |> collect
  crsp.compid = Vector{Union{Int, Missing}}(undef, size(crsp,1))

  #faster to work within the rows for a particular permno
  Threads.@threads for permno โˆˆ keys(scomps)
    scomp = scomps[permno]
    crsppermnokey = permno |> Tuple
    (!haskey(scrsps,crsppermnokey)) && continue
    #because the key originates from comp, cant use it directly on crsp- instead need its value
    scrsp = scrsps[crsppermnokey]

    #the idea here is to select all of the crsp rows between the effective date
    #and the end date of the comp row
    for r โˆˆ eachrow(scomp)
      targetrows = view(scrsp, (r.effdate .โ‰ค scrsp.date) .& (scrsp.date .โ‰ค r.enddate), :compid)
      if length(targetrows) > 0
        #under no circumstances should this be assigned, since date and permno should be unique
        (targetrows .=== missing) |> all || error(
          "multiple crsp rows found for a particular compid!!")
        targetrows .= r.compid
      end
    end
  end

  crsp = crsp[crsp.compid .!== missing, :]

  #now merge the linked rows
  rename!(comp, :permno=>:lpermno)
  panel = innerjoin(crsp, comp, on=:compid)

  #sanity checks
  @assert size(panel,1) == size(crsp,1)
  @assert (crsp.permno .== panel.permno) |> all
  @assert (crsp.date .== panel.date) |> all

  return panel
end