Conditional left join 2 dataframes when none of the columns are common

df1 = DataFrame(id = [1,101,1001], b = [1,2,3])
df2 = DataFrame(id1 = [0, 95 ], id2 = [50, 150 ], c = [2,3])

I want to conditionally left join df1 and df2 such that id1 < id < id2 and all the rows in df1 appear in the final dataframe. I could not find a way to do this in dataframes. Neither could I figure out how to use FlexiJoins to achieve this. R had a package, I believe funneljoin which made it extremely easy to achieve joins like these.

FlexiJoins.jl supports many table types, but not all. Specifically, I think it doesn’t work with DataFrames.jl.
Joining by the condition you need is pretty easy and efficient with FlexiJoins:

julia> using Tables, FlexiJoins, IntervalSets

julia> tbl1 = (id = [1,101,1001], b = [1,2,3]) |> rowtable
julia> tbl2 = (id1 = [0, 95 ], id2 = [50, 150 ], c = [2,3]) |> rowtable

julia> leftjoin((tbl1, tbl2), by_pred(:id, ∈, r -> r.id1..r.id2))

Here, I use rowtable (vector-of-namedtuples) as the input table type for simplicity.

1 Like

You can do it relatively easily:

julia> @rsubset(crossjoin(df1, df2), :id1 < :id < :id2)
2Γ—5 DataFrame
 Row β”‚ id     b      id1    id2    c
     β”‚ Int64  Int64  Int64  Int64  Int64
─────┼───────────────────────────────────
   1 β”‚     1      1      0     50      2
   2 β”‚   101      2     95    150      3

(I am additionally using DataFramesMeta.jl for convenience)

The only problem is that it would not be efficient. However, if your data frames are small (in a sense that product of their numbers of rows is not huge) this should work.

2 Likes

Can u explain what r.id1…r.id2 means. The 2 dots seem unnecessarily confusing. I would have thought it would be something like this:

by_pred(:id1, < , :id) & by_pred(:id2, >, :id)

Edit: I think I get it now. this is from the IntervalSets package. Thanks a lot for your help.

my dataframe has millions of rows and have to do this operation multiple times. This might be too slow.

Yes, the dots create an interval using the IntervalSets package. Combining predicates < and > as you suggest is potentially possible as well, but not efficiently implemented for now. It only works with the β€œnested loops” mode that iterates over each pair of rows and isn’t efficient for large tables:

julia> leftjoin((tbl1, tbl2), by_pred(:id, > , :id1) & by_pred(:id, <, :id2), mode=FlexiJoins.Mode.NestedLoop())

Sorry I tried the intervalsets version and it failed. I got an error:

the combining predicates < and > did not work either.

Maybe, you are using an older version? Try updating, the latest is 0.1.3.

That helped. But the output I get is a StructVector. I started with 2 dataframes. I converted them to rowtable to be able to use flexijoins and now the output is a StructVector./StructArray Is there a way to have the output be converted back to a dataframe?

By design, FlexiJoins never merges objects from the first and the second datasets into a single one. This makes it always clear which column comes from which side for tables. For non-tabular datasets, merging can be impossible, if elements are some custom structs.

Any merging like that should be done afterwards, if needed. This is explained in docs-examples:


It’s possible that there is a more direct way to get a DataFrame, given that it also uses column-based storage. But I don’t use DataFrames and not familiar with them, so cannot help here.

1 Like

Thanks a lot for the quick responses. Very helpful. Learnt a lot.

I have tried several ways of producing the desired result. Performance is different.
I would like someone (@aplavin ?) to explain to me where / how / why flexijoins’ leftjoin does better than the jf() function.

julia> using Tables, FlexiJoins, IntervalSets

julia> v11=rand(1:100, 10^3);

julia> v12=rand(1:2, 10^3);

julia> v21=rand(1:100, 10^4);

julia> v22=rand(101:200, 10^4);

julia> v23=rand(2:3,10^4);

julia> tbl1=rowtable((id=v11,b=v12));

julia> tbl2=rowtable((id1=v21,id2=v22,c=v23));

julia> using DataFrames,DataFramesMeta

julia> df1=DataFrame(id=v11,b=v12);

julia> df2=DataFrame(id1=v21,id2=v22,c=v23);


########  benchmarks##############

julia> using BenchmarkTools

julia> @btime  @rsubset(crossjoin(df1, df2), :id1 < :id < :id2);
  218.115 ms (590 allocations: 622.76 MiB)

julia> using StatsBase

julia> @btime begin
       j=pairwise((x,y)->ifelse( y.id1<x.id<y.id2, merge(x,y), nothing), tbl1,tbl2)
       [r for r in j if !isnothing(r)] |> rowtable
       end
670.530 ms (37 allocations: 949.31 MiB)

julia> @btime [(x,y) for (x,y) in Iterators.product(tbl1,tbl2) if y.id1<=x.id<=y.id2];
  218.760 ms (22 allocations: 366.20 MiB)

julia> @btime [merge(x,y) for (x,y) in Iterators.product(tbl1,tbl2) if y.id1<=x.id<=y.id2] |> DataFrame;
  285.113 ms (56 allocations: 562.11 MiB)

julia> function jf(tbl1,tbl2)
           df=NamedTuple[]
           for x in tbl1
               X=x.id
               for y in tbl2
                   if y.id1<=X<=y.id2
                   push!(df,merge(x,y))
                   end
               end
           end
           return df
       end
jf (generic function with 1 method)

julia> @btime jf(tbl1,tbl2);
  146.250 ms (5135583 allocations: 308.80 MiB)

julia> @btime lj=leftjoin((tbl1, tbl2), by_pred(:id, ∈, r -> r.id1..r.id2))
  66.516 ms (245064 allocations: 166.71 MiB)

if I walk the same steps without PUSH the time becomes very short.
So all the time seems to be due to push operations …

julia> function jft(tbl1,tbl2)
           df=Tuple[]
           for x in tbl1
               X=x.id
               for y in tbl2
                    if y.id1<=X<=y.id2
                       #push!(df,(x,y))
                    end
               end
           end
           return df
       end
jft (generic function with 1 method)

julia> @btime jft(tbl1,tbl2);
  28.141 ns (1 allocation: 48 bytes)

another way that seems more efficient would be to obtain the indexes of the rows of tbl2 corresponding to each row of tbl1, without having to allocate the complete rows.

julia> function jft(tbl1,tbl2)
           v=Vector{Vector{Int64}}()
           for (i,x) in enumerate(tbl1)
               X=x.id
               push!(v,[])
               for (j,y) in enumerate(tbl2)
                   if X ∈ y.id1..y.id2
                       push!(v[i],j)
                   end
               end
           end
           return view.([tbl2],v)
       end
jft (generic function with 1 method)

julia> @btime jft(tbl1,tbl2);
  65.533 ms (8419 allocations: 108.75 MiB)
1 Like

First, these are infeasible timings for iterating over thounsands of elements:

They mean that the function almost gets optimized away by the compiler, because it determines that the loops don’t affect the result.

In your example, FlexiJoins’ join is faster than naive approaches, but not by much. The difference is relatively small because the join results contain almost all tbl1-tbl2 pairs, specifically about a half of such pairs.
A more typical case is when the number of rows in the result is comparable to the number of rows in one of the input tables. That is, about 0.1-0.01% of the total number of pairs given the table lengths of 10^3 - 10^4. If you set the inputs so that the number of results is much smaller than the number of pairs, FlexiJoins performance should be significantly better.

FlexiJoins is faster because it doesn’t loop over all pairs, filtering them. It uses sorting and binary search instead in this particular case.

2 Likes

Grazie per le spiegazioni!

Let’s see if I’ve learned anything.
In general, since the join involves a lot of searches on the tables, it is better to sort the tables once and
for all and do the many searches using the sorting.
In this case I imagine to sort tbl1 by increasing ID and tbl2 by increasing ID1 and decreasing ID2 (how do you do it?), obtaining the following situation.

At this point the problem becomes to look for the green intervals.
Are standard techniques known for doing this research or is it a specific case that should be treated as such?

Yes, it is a standard thing to do such a join on a sorted table (or if it is not sorted to do sortperm in the algorithm).

Currently, FlexiJoins sorts only one dataset - that with the id column in this case. Then it goes through all the id1-id2 intervals as they are, and finds the range of sorted id values that fall into each interval. This uses binary search.

It is potentially possible to sort the other dataset as well and through both at once in sorted order. This would definitely be faster for a simple < or > predicate, not sure about interval containment. The current approach is easier to make general though, which makes it possible for FlexiJoins to implement many different join conditions and modes with a pretty small code size.

2 Likes

I had thought of this strategy (divide et impera).
Group the two tables: one by ID and the other by (ID1, ID2).
Then do the flexijoin on the keys.

I have simulated this in the dataframe environment that I know a little better.

using DataFrames,DataFramesMeta
v11=rand(5:120, 10^3);
v12=rand(1:10, 10^3);
v21=rand(1:5:100, 10^4);
v22= map(e21->e21 + rand(1:3:11), v21)
v23=rand(2:10,10^4);

df1=unique(DataFrame(id=v11,b=v12));
df2=unique(DataFrame(id1=v21,id2=v22,c=v23));
gdf2= groupby(df2,[:id1,:id2],sort=true)

gdf1=groupby(df1,:id)

dfk1=DataFrame(k1=first.(keys(gdf1)))
dfk2=DataFrame(k2=Tuple.(keys(gdf2)))

jdf=subset(crossjoin(dfk1, dfk2), [:k1,:k2]=>ByRow((x,y)-> x ∈ y[1]..y[2]))

tr=transform(jdf, [:k1,:k2]=>ByRow((x,y)->(X=DataFrame(gdf1[(x,)]), Y=DataFrame(gdf[y])))=>AsTable)

select(tr, [:X,:Y]=>ByRow(crossjoin))

1 Like

Yes, potentially, grouping can help when there are few unique values compared to the total number of elements. Not sure how large performance gains would be relative to an optimally implemented sorting approach. Also, there would be extra overhead when all values are unique - common for floating point measurements.

I tried using the code with left end closed interval and the right end open interval and could not get it to work. error thrown by flexijoins.jl

# this was changed from: 
julia> leftjoin((tbl1, tbl2), by_pred(:id, ∈, r -> r.id1..r.id2))

#changed to:
julia> leftjoin((tb1, tbl2), by_pred(:id, ∈, r -> Interval{:closed, :open}(r.id1, r.id2)))

Error I get is thrown by flexijoins.jl package: