Merge dataframes where one value is between two others

junder873 · February 16, 2020, 11:32pm

This question is very close to this Stackoverflow question. The primary issue is merging two dataframes with a restriction based on date ranges. This is a common issue in financial data, in my case, stock returns.

The easiest way to do this is to simply join the dataframes and then filter:

df = join(df1, df2, on=:id, kind=:left)
df = df[df[:, :dateStart] .<= df[:, :date] .<= df[:, :dateEnd], :]

While this works fine for small datasets, stock return data tends to be large. In my case, I am trying to merge a dataframe with 20 million rows to a dataframe with 80 thousand rows, which takes up huge amounts of RAM. So, my initial question would be is there an easier way to do this in Julia?

I tried the Query.jl package, but it seems to join first and filter second similar to my initial code.

Another option, as suggested by the Stackoverflow answer, is to use SQLite. So, using SQLite.jl, I tried to run the code. It works but for some reason the filtering is not correct. I tried to replicate this with simpler data, but could not, so I don’t know why it doesn’t work. I uploaded the CSV data here: data. The code I am using is:

df1 = CSV.read("df1.csv")
df2 = CSV.read("df2.csv")
db = SQLite.DB();
SQLite.load!(df1, db, "df1")
SQLite.load!(df2, db, "df2")
sqlcode = """
select df1.permno, df1.dateStart, df1.dateEnd, df2.retDate
from df1
left join df2 on df1.permno=df2.permno
where df1.dateStart <= df2.retDate and df1.dateEnd >= df2.retDate"""
dfOut = SQLite.Query(db, sqlcode) |> DataFrame

Each observation in df1 is a firm (permno) with a date range, which is 5 business days long. The other dataframe, df2, would include the observations (retDate). Joining the two should result in 50 observations, but results in 605. When doing a similar code in python I get the correct 50 results:

df1 = pd.read_csv('df1.csv')
df2 = pd.read_csv('df2.csv')
conn = sqlite3.connect(':memory:')
df1.to_sql('df1', conn, index=False)
df2.to_sql('df2', conn, index=False)
sqlcode = """
select df1.permno, df1.dateStart, df1.dateEnd, df2.retDate
from df1
left join df2 on df1.permno=df2.permno
where df1.dateStart <= df2.retDate and df1.dateEnd >= df2.retDate"""
dfOut = pd.read_sql_query(sqlcode, conn)

Any advice on what I am doing wrong or what Julia is doing differently?

tk3369 · February 17, 2020, 6:17am

Not sure why sqlite doesn’t work properly. Are you sure doing it in Julia isn’t good enough? Here’s a naive solution:

julia> function myjoin(row, df2)
           res = filter(df2) do r
               r.permno == row.permno && 
               r.retDate >= row.dateStart &&
               r.retDate <= row.dateEnd
           end
           dfrow = DataFrame(row)
           left = reduce(vcat, (dfrow for i in eachindex(res[!,1])))
           right = res[!, 2:end]
           hcat(left, right)
       end

julia> reduce(vcat, (myjoin(r, df2) for r in eachrow(df1)))
50×4 DataFrame
│ Row │ permno  │ dateStart  │ dateEnd    │ retDate    │
│     │ Float64 │ Dates.Date │ Dates.Date │ Dates.Date │
├─────┼─────────┼────────────┼────────────┼────────────┤
│ 1   │ 10001.0 │ 2010-11-12 │ 2010-11-18 │ 2010-11-12 │
│ 2   │ 10001.0 │ 2010-11-12 │ 2010-11-18 │ 2010-11-15 │
│ 3   │ 10001.0 │ 2010-11-12 │ 2010-11-18 │ 2010-11-16 │
⋮
│ 48  │ 10001.0 │ 2016-05-13 │ 2016-05-19 │ 2016-05-17 │
│ 49  │ 10001.0 │ 2016-05-13 │ 2016-05-19 │ 2016-05-18 │
│ 50  │ 10001.0 │ 2016-05-13 │ 2016-05-19 │ 2016-05-19 │

As the join column isn’t indexed, if df2 is large then it would not be very performant. Perhaps try IndexedTables or create your own Dict lookup to speed things up.

Skoffer · February 17, 2020, 7:27am

Your query is not quite correct, with this where clause, left join is effectively inner join. It is possible that there is a bug in sqlite which process this case incorrectly, so it would be more correct to set inner join from the start.

bkamins · February 17, 2020, 8:57am

As a reference - soon (hopefully) DataFrames.jl will provide fast indexing via groupby, see Speed up key lookup in GroupedDataFrame · Issue #2095 · JuliaData/DataFrames.jl · GitHub.

junder873 · February 27, 2020, 1:37am

Thank you for this suggestion. I spent some time this past week since the suggested solution was relatively slow on really large datasets (the full dataset for df1 is ~80,000 rows, df2 is about ~20 million). I wanted to post my solution since others might have similar problems:

using IndexedTables: ndsparse, rows

function myJoin(df1::DataFrame, df2::DataFrame)
    t = ndsparse(
        (permno = df2[:, :permno], retDate = df2[:, :retDate]),
        (index = 1:size(df2, 1),),
    )
    ret = DataFrame(index1 = Int[], index2 = Int[])
    for i = 1:size(df1, 1)
        res = t[
            df1[i, :permno],
            collect(df1[i, :dateStart]:Day(1):df1[i, :dateEnd])
        ]
        for v in rows(res)
            push!(ret, (i, v.index))
        end
    end
    df1[!, :index1] = 1:size(df1, 1)
    df2[!, :index2] = 1:size(df2, 1)
    df1 = join(df1, ret, on=:index1, kind=:left)
    select!(df1, Not(:permno))
    df1 = join(df1, df2, on=:index2, kind=:left)
    select!(df1, Not([:index1, :index2]))
    return df1
end

Even on the large dataset this still runs in 12 or so seconds, which is easily fast enough for what I do.

Once again, thank you for your help.

Yifan_Liu · October 4, 2020, 2:18am

Is there a more elegant way to merge by range?

anon92994695 · October 4, 2020, 2:47am

I have a package that should do this by reading through N numbers of dataframes just once: https://github.com/caseykneale/LockandKeyLookups.jl

Not sure if it’s registered, and I don’t remember exactly how it works to be honest… Could use performance improvements. Think the idea is it creates an iterator where each iteration is a match between a “lock” stream with a number of chunks to match against via some arbitrary function… Maybe I shouldn’t be recommending this

clinton · November 22, 2020, 4:24am

Interesting discussion. I took a somewhat different approach than the solutions presented here. It’s a bit closer to an actual merge by range, and I’ve found it to be quite fast (thanks to the many performance enhancements made in DataFrames over the past few years). Please excuse the slightly different naming convention.

using DataFrames, Dates

function mergecrspcompkernel(;crsp=error("crsp is required"), comp=error("comp is required"))
  ###Merge one permno at a time, linking each crsp row to a compid (comp row)
  scrsps = groupby(crsp, :permno)
  scomps = groupby(comp, :permno)
  comp.compid = 1:size(comp,1) |> collect
  crsp.compid = Vector{Union{Int, Missing}}(undef, size(crsp,1))

  #faster to work within the rows for a particular permno
  Threads.@threads for permno ∈ keys(scomps)
    scomp = scomps[permno]
    crsppermnokey = permno |> Tuple
    (!haskey(scrsps,crsppermnokey)) && continue
    #because the key originates from comp, cant use it directly on crsp- instead need its value
    scrsp = scrsps[crsppermnokey]

    #the idea here is to select all of the crsp rows between the effective date
    #and the end date of the comp row
    for r ∈ eachrow(scomp)
      targetrows = view(scrsp, (r.effdate .≤ scrsp.date) .& (scrsp.date .≤ r.enddate), :compid)
      if length(targetrows) > 0
        #under no circumstances should this be assigned, since date and permno should be unique
        (targetrows .=== missing) |> all || error(
          "multiple crsp rows found for a particular compid!!")
        targetrows .= r.compid
      end
    end
  end

  crsp = crsp[crsp.compid .!== missing, :]

  #now merge the linked rows
  rename!(comp, :permno=>:lpermno)
  panel = innerjoin(crsp, comp, on=:compid)

  #sanity checks
  @assert size(panel,1) == size(crsp,1)
  @assert (crsp.permno .== panel.permno) |> all
  @assert (crsp.date .== panel.date) |> all

  return panel
end

Topic		Replies	Views
Join Dataframes on different IDs within date range New to Julia query , dataframes	1	495	October 4, 2020
Asof join support in DataFrames.jl Data dataframes	9	919	October 13, 2022
Filtering one DataFrame using parameters of another, with different lengths New to Julia dataframes	5	364	March 8, 2023
Asof merge in Julia General Usage dataframes , merge	12	1583	March 23, 2022
Conditional left join 2 dataframes when none of the columns are common General Usage dataframes , flexijoins	26	1994	May 26, 2022

Merge dataframes where one value is between two others

Related topics