Conditional left join 2 dataframes when none of the columns are common

vtomar · May 21, 2022, 12:05am

df1 = DataFrame(id = [1,101,1001], b = [1,2,3])
df2 = DataFrame(id1 = [0, 95 ], id2 = [50, 150 ], c = [2,3])

I want to conditionally left join df1 and df2 such that id1 < id < id2 and all the rows in df1 appear in the final dataframe. I could not find a way to do this in dataframes. Neither could I figure out how to use FlexiJoins to achieve this. R had a package, I believe funneljoin which made it extremely easy to achieve joins like these.

aplavin · May 21, 2022, 7:36am

FlexiJoins.jl supports many table types, but not all. Specifically, I think it doesn’t work with DataFrames.jl.
Joining by the condition you need is pretty easy and efficient with FlexiJoins:

julia> using Tables, FlexiJoins, IntervalSets

julia> tbl1 = (id = [1,101,1001], b = [1,2,3]) |> rowtable
julia> tbl2 = (id1 = [0, 95 ], id2 = [50, 150 ], c = [2,3]) |> rowtable

julia> leftjoin((tbl1, tbl2), by_pred(:id, ∈, r -> r.id1..r.id2))

Here, I use rowtable (vector-of-namedtuples) as the input table type for simplicity.

bkamins · May 21, 2022, 8:53am

You can do it relatively easily:

julia> @rsubset(crossjoin(df1, df2), :id1 < :id < :id2)
2×5 DataFrame
 Row │ id     b      id1    id2    c
     │ Int64  Int64  Int64  Int64  Int64
─────┼───────────────────────────────────
   1 │     1      1      0     50      2
   2 │   101      2     95    150      3

(I am additionally using DataFramesMeta.jl for convenience)

The only problem is that it would not be efficient. However, if your data frames are small (in a sense that product of their numbers of rows is not huge) this should work.

vtomar · May 21, 2022, 11:34pm

Can u explain what r.id1…r.id2 means. The 2 dots seem unnecessarily confusing. I would have thought it would be something like this:

by_pred(:id1, < , :id) & by_pred(:id2, >, :id)

Edit: I think I get it now. this is from the IntervalSets package. Thanks a lot for your help.

vtomar · May 21, 2022, 11:37pm

my dataframe has millions of rows and have to do this operation multiple times. This might be too slow.

aplavin · May 22, 2022, 3:52am

Yes, the dots create an interval using the IntervalSets package. Combining predicates < and > as you suggest is potentially possible as well, but not efficiently implemented for now. It only works with the “nested loops” mode that iterates over each pair of rows and isn’t efficient for large tables:

julia> leftjoin((tbl1, tbl2), by_pred(:id, > , :id1) & by_pred(:id, <, :id2), mode=FlexiJoins.Mode.NestedLoop())

vtomar · May 22, 2022, 3:55am

Sorry I tried the intervalsets version and it failed. I got an error:

vtomar · May 22, 2022, 4:00am

the combining predicates < and > did not work either.

aplavin · May 22, 2022, 4:00am

Maybe, you are using an older version? Try updating, the latest is 0.1.3.

vtomar · May 22, 2022, 4:24am

That helped. But the output I get is a StructVector. I started with 2 dataframes. I converted them to rowtable to be able to use flexijoins and now the output is a StructVector./StructArray Is there a way to have the output be converted back to a dataframe?

aplavin · May 22, 2022, 4:41am

By design, FlexiJoins never merges objects from the first and the second datasets into a single one. This makes it always clear which column comes from which side for tables. For non-tabular datasets, merging can be impossible, if elements are some custom structs.

Any merging like that should be done afterwards, if needed. This is explained in docs-examples:

It’s possible that there is a more direct way to get a DataFrame, given that it also uses column-based storage. But I don’t use DataFrames and not familiar with them, so cannot help here.

vtomar · May 22, 2022, 4:45am

Thanks a lot for the quick responses. Very helpful. Learnt a lot.

rocco_sprmnt21 · May 22, 2022, 5:31pm

I have tried several ways of producing the desired result. Performance is different.
I would like someone (@aplavin ?) to explain to me where / how / why flexijoins’ leftjoin does better than the jf() function.

julia> using Tables, FlexiJoins, IntervalSets

julia> v11=rand(1:100, 10^3);

julia> v12=rand(1:2, 10^3);

julia> v21=rand(1:100, 10^4);

julia> v22=rand(101:200, 10^4);

julia> v23=rand(2:3,10^4);

julia> tbl1=rowtable((id=v11,b=v12));

julia> tbl2=rowtable((id1=v21,id2=v22,c=v23));

julia> using DataFrames,DataFramesMeta

julia> df1=DataFrame(id=v11,b=v12);

julia> df2=DataFrame(id1=v21,id2=v22,c=v23);


########  benchmarks##############

julia> using BenchmarkTools

julia> @btime  @rsubset(crossjoin(df1, df2), :id1 < :id < :id2);
  218.115 ms (590 allocations: 622.76 MiB)

julia> using StatsBase

julia> @btime begin
       j=pairwise((x,y)->ifelse( y.id1<x.id<y.id2, merge(x,y), nothing), tbl1,tbl2)
       [r for r in j if !isnothing(r)] |> rowtable
       end
670.530 ms (37 allocations: 949.31 MiB)

julia> @btime [(x,y) for (x,y) in Iterators.product(tbl1,tbl2) if y.id1<=x.id<=y.id2];
  218.760 ms (22 allocations: 366.20 MiB)

julia> @btime [merge(x,y) for (x,y) in Iterators.product(tbl1,tbl2) if y.id1<=x.id<=y.id2] |> DataFrame;
  285.113 ms (56 allocations: 562.11 MiB)

julia> function jf(tbl1,tbl2)
           df=NamedTuple[]
           for x in tbl1
               X=x.id
               for y in tbl2
                   if y.id1<=X<=y.id2
                   push!(df,merge(x,y))
                   end
               end
           end
           return df
       end
jf (generic function with 1 method)

julia> @btime jf(tbl1,tbl2);
  146.250 ms (5135583 allocations: 308.80 MiB)

julia> @btime lj=leftjoin((tbl1, tbl2), by_pred(:id, ∈, r -> r.id1..r.id2))
  66.516 ms (245064 allocations: 166.71 MiB)

if I walk the same steps without PUSH the time becomes very short.
So all the time seems to be due to push operations …

julia> function jft(tbl1,tbl2)
           df=Tuple[]
           for x in tbl1
               X=x.id
               for y in tbl2
                    if y.id1<=X<=y.id2
                       #push!(df,(x,y))
                    end
               end
           end
           return df
       end
jft (generic function with 1 method)

julia> @btime jft(tbl1,tbl2);
  28.141 ns (1 allocation: 48 bytes)

another way that seems more efficient would be to obtain the indexes of the rows of tbl2 corresponding to each row of tbl1, without having to allocate the complete rows.

julia> function jft(tbl1,tbl2)
           v=Vector{Vector{Int64}}()
           for (i,x) in enumerate(tbl1)
               X=x.id
               push!(v,[])
               for (j,y) in enumerate(tbl2)
                   if X ∈ y.id1..y.id2
                       push!(v[i],j)
                   end
               end
           end
           return view.([tbl2],v)
       end
jft (generic function with 1 method)

julia> @btime jft(tbl1,tbl2);
  65.533 ms (8419 allocations: 108.75 MiB)

aplavin · May 22, 2022, 6:57pm

First, these are infeasible timings for iterating over thounsands of elements:

They mean that the function almost gets optimized away by the compiler, because it determines that the loops don’t affect the result.

In your example, FlexiJoins’ join is faster than naive approaches, but not by much. The difference is relatively small because the join results contain almost all tbl1-tbl2 pairs, specifically about a half of such pairs.
A more typical case is when the number of rows in the result is comparable to the number of rows in one of the input tables. That is, about 0.1-0.01% of the total number of pairs given the table lengths of 10^3 - 10^4. If you set the inputs so that the number of results is much smaller than the number of pairs, FlexiJoins performance should be significantly better.

FlexiJoins is faster because it doesn’t loop over all pairs, filtering them. It uses sorting and binary search instead in this particular case.

rocco_sprmnt21 · May 23, 2022, 7:22am

Grazie per le spiegazioni!

Let’s see if I’ve learned anything.
In general, since the join involves a lot of searches on the tables, it is better to sort the tables once and
for all and do the many searches using the sorting.
In this case I imagine to sort tbl1 by increasing ID and tbl2 by increasing ID1 and decreasing ID2 (how do you do it?), obtaining the following situation.

At this point the problem becomes to look for the green intervals.
Are standard techniques known for doing this research or is it a specific case that should be treated as such?

bkamins · May 23, 2022, 7:35am

Yes, it is a standard thing to do such a join on a sorted table (or if it is not sorted to do sortperm in the algorithm).

aplavin · May 23, 2022, 5:12pm

Currently, FlexiJoins sorts only one dataset - that with the id column in this case. Then it goes through all the id1-id2 intervals as they are, and finds the range of sorted id values that fall into each interval. This uses binary search.

It is potentially possible to sort the other dataset as well and through both at once in sorted order. This would definitely be faster for a simple < or > predicate, not sure about interval containment. The current approach is easier to make general though, which makes it possible for FlexiJoins to implement many different join conditions and modes with a pretty small code size.

rocco_sprmnt21 · May 23, 2022, 6:46pm

I had thought of this strategy (divide et impera).
Group the two tables: one by ID and the other by (ID1, ID2).
Then do the flexijoin on the keys.

I have simulated this in the dataframe environment that I know a little better.

using DataFrames,DataFramesMeta
v11=rand(5:120, 10^3);
v12=rand(1:10, 10^3);
v21=rand(1:5:100, 10^4);
v22= map(e21->e21 + rand(1:3:11), v21)
v23=rand(2:10,10^4);

df1=unique(DataFrame(id=v11,b=v12));
df2=unique(DataFrame(id1=v21,id2=v22,c=v23));
gdf2= groupby(df2,[:id1,:id2],sort=true)

gdf1=groupby(df1,:id)

dfk1=DataFrame(k1=first.(keys(gdf1)))
dfk2=DataFrame(k2=Tuple.(keys(gdf2)))

jdf=subset(crossjoin(dfk1, dfk2), [:k1,:k2]=>ByRow((x,y)-> x ∈ y[1]..y[2]))

tr=transform(jdf, [:k1,:k2]=>ByRow((x,y)->(X=DataFrame(gdf1[(x,)]), Y=DataFrame(gdf[y])))=>AsTable)

select(tr, [:X,:Y]=>ByRow(crossjoin))

aplavin · May 23, 2022, 7:34pm

Yes, potentially, grouping can help when there are few unique values compared to the total number of elements. Not sure how large performance gains would be relative to an optimally implemented sorting approach. Also, there would be extra overhead when all values are unique - common for floating point measurements.

vtomar · May 24, 2022, 4:45pm

I tried using the code with left end closed interval and the right end open interval and could not get it to work. error thrown by flexijoins.jl

# this was changed from: 
julia> leftjoin((tbl1, tbl2), by_pred(:id, ∈, r -> r.id1..r.id2))

#changed to:
julia> leftjoin((tb1, tbl2), by_pred(:id, ∈, r -> Interval{:closed, :open}(r.id1, r.id2)))

Error I get is thrown by flexijoins.jl package:

Topic		Replies	Views
[ANN] FlexiJoins.jl: fresh take on joining datasets Package Announcements	32	3183	April 3, 2025
[ANN] DataFrameIntervals.jl — joins on intervals of time Package Announcements dataframes	40	2286	October 31, 2022
Asof join support in DataFrames.jl Data dataframes	9	919	October 13, 2022
Spatial join with dataframes General Usage dataframes , geo	27	2081	September 6, 2024
Filtering one DataFrame using parameters of another, with different lengths New to Julia dataframes	5	364	March 8, 2023

Conditional left join 2 dataframes when none of the columns are common

Related topics