Arbitrary table join conditions

derekmahar · August 12, 2020, 8:50pm

What Julia tabular data types and functions might I use in order to join two tables using a join condition that is more complex than simple equality between sets of columns in each table (an equi-join)? As far as I can tell, none of the join functions in DataFrames.jl, for example, can accept an arbitrary join predicate. Do any other Julia tabular types have join functions that accept a join predicate?

As a specific example, I’d like to translate the following SQL schema and query to Julia.

SQL Fiddle

PostgreSQL 9.6 Schema Setup:

CREATE TABLE exchange_rate(time INT, rate FLOAT);

INSERT INTO exchange_rate VALUES
(1, 1.0)
,(4, 1.1)
,(5, 1.5)
,(7, 2.0)
,(9, 1.8)
,(10, 1.7)
,(11, 1.9)
;

CREATE TABLE trade(time int);

INSERT INTO trade VALUES
(1)
,(2)
,(3)
,(4)
,(5)
,(6)
,(7)
,(8)
,(9)
,(10)
,(11)
,(12)
;

Query 1:

SELECT
  t.time,
  e.rate
FROM trade t
JOIN exchange_rate e
  ON e.time = (SELECT MAX(time) FROM exchange_rate WHERE time <= t.time)

Results:

time	rate
1	1
2	1
3	1
4	1.1
5	1.5
6	1.5
7	2
8	2
9	1.8
10	1.7
11	1.9
12	1.9

I described this scenario in more detail in issue Feature request: Complex join conditions that I raised for the Miller CSV processing tool. Instead of a nested sub-query like the one that I used in the example query above, the SQL query that I showed in the Miller issue uses a Common Table Expression and a window function.

(As an aside, this is an example of a Gaps and Islands problem because there are gaps between consecutive time sequences in the exchange_rate table while the times in the trade table are continuous, or gap-free.)

Tamas_Papp · August 13, 2020, 7:20am

Joining is tricky to optimize already for large tables. What you are proposing may be implemented efficiently for a limited DSL, but I don’t see how it could be made fast for generic Julia code.

You can either

implement this directly with loops,
when applicable, calculate interim results and join on those.

derekmahar · August 14, 2020, 1:50pm

Does there not exist a join method that accepts a predicate primarily because it can’t be executed efficiently for all or most cases?

Joins in SQL are not efficient for all cases, either, but it still allows arbitrarily complex join conditions. The example query that I showed would be efficient for a large exchange_rate table only if an index existed on column time or if it were sorted on time and the subquery found the maximum using a binary search. Without these preconditions, the query would be terribly slow for large numbers of exchange rates.

Can a Julia DataFrame have an index on one or more columns? Can it be sorted by columns so that searches or aggregate functions like maximum are more efficient?

Do there exist any other Julia packages of tabular types that accomodate fast search or aggregate function computation?

bkamins · August 14, 2020, 1:56pm

You might want to check out SplitApplyCombine.jl which accepts the following:

  innerjoin(lkey, rkey, f, comparison, left, right)

  Performs a relational-style join operation between iterables left and right, returning a collection of elements f(l, r) for which
  comparison(lkey(l), rkey(r)) is true where l ∈ left, r ∈ right.

johnmyleswhite · August 14, 2020, 1:57pm

The world of interactive analytics has broadly not focused on anything but equi-joins. It’s a very big gap in functionality, but it’s the same gap you’ll find in the popular R or Python libraries. It would be awesome to see Julia innovate in this space.

[Edit: Looks like I was missing the innerjoin feature described above.]

derekmahar · August 14, 2020, 2:02pm

This surprises me because SQL has had “non-equi-joins” for years. Even before SQL introduced the JOIN...ON clause, one could specify join conditions in the WHERE clause.

derekmahar · August 14, 2020, 2:12pm

Does SplitApplyCombine.jl work directly with DataFrames or would the idea be to pre-process the data using those functions and then insert it into a DataFrame?

bkamins · August 14, 2020, 2:19pm

For now you have to preprocess in SplitApplyCombine.jl and then get indices to filter DataFrame on. Hopefully soon we will change the joing backend in DataFrames.jl - see https://github.com/JuliaData/DataFrames.jl/issues/2340. Actually the major motivator was speed (and now we want to improve SplitApplyCombine.jl speed more), but this would be a nice side benefit.

derekmahar · August 14, 2020, 2:46pm

This is an interesting issue that I’ll follow more closely. I’m still new to Julia and am feeling my way around the available data packages. I’ve learned a lot about Dataframes from your various tutorials on YouTube.

I’ve come to Julia from Java and SQL and over the past several months have explored interesting, but obscure tools like Miller and csvq, and the highly popular SQLite. On this journey, I somehow completely overlooked Python and pandas and R, probably because I took a detour into Haskell and have a strong bias towards statically typed programming languages.

derekmahar · August 16, 2020, 12:47am

Here’s my deliberately verbose solution using DataFrames.jl and SplitApplyCombine.jl that produces the same result as my original SQL query:

julia> using DataFrames

julia> using SplitApplyCombine: innerjoin

julia> exchange_rates = DataFrame(time = [1,4,5,7,9,10,11], rate = [1.0,1.1,1.5,2.0,1.8,1.7,1.9])
7×2 DataFrame
│ Row │ time  │ rate    │
│     │ Int64 │ Float64 │
├─────┼───────┼─────────┤
│ 1   │ 1     │ 1.0     │
│ 2   │ 4     │ 1.1     │
│ 3   │ 5     │ 1.5     │
│ 4   │ 7     │ 2.0     │
│ 5   │ 9     │ 1.8     │
│ 6   │ 10    │ 1.7     │
│ 7   │ 11    │ 1.9     │

julia> trades = DataFrame(time = 1:12)
12×1 DataFrame
│ Row │ time  │
│     │ Int64 │
├─────┼───────┤
│ 1   │ 1     │
│ 2   │ 2     │
│ 3   │ 3     │
│ 4   │ 4     │
│ 5   │ 5     │
│ 6   │ 6     │
│ 7   │ 7     │
│ 8   │ 8     │
│ 9   │ 9     │
│ 10  │ 10    │
│ 11  │ 11    │
│ 12  │ 12    │

julia> trade_time(trade::DataFrameRow)::Int64 = trade.time
trade_time (generic function with 1 method)

julia> exchange_rate_time(exchange_rate::DataFrameRow)::Int64 = exchange_rate.time
exchange_rate_time (generic function with 1 method)
julia> trade_and_exchange_rate(trade::DataFrameRow, exchange_rate::DataFrameRow)::NamedTuple{(:trade, :rate),Tuple{Int64,Float64}} = (trade = trade.time, rate = exchange_rate.rate)
trade_and_exchange_rate (generic function with 1 method)
julia> exchange_rate_time_search(exchange_rate_times::Array{Int64,1}, trade_time::Int64)::Int64 = maximum(filter(time -> time <= trade_time, exchange_rate_times))
exchange_rate_time_search (generic function with 1 method)
julia> trade_exchange_rate_join_condition(trade_time::Int64, exchange_rate_time::Int64)::Bool = exchange_rate_time == exchange_rate_time_search(exchange_rates.time, trade_time)
trade_exchange_rate_join_condition (generic function with 1 method)

julia> trades_and_exchange_rates = DataFrame(
         innerjoin(
           trade_time,
           exchange_rate_time,
           trade_and_exchange_rate,
           trade_exchange_rate_join_condition,
           eachrow(trades),
           eachrow(exchange_rates)
         )
       )
12×2 DataFrame
│ Row │ trade │ rate    │
│     │ Int64 │ Float64 │
├─────┼───────┼─────────┤
│ 1   │ 1     │ 1.0     │
│ 2   │ 2     │ 1.0     │
│ 3   │ 3     │ 1.0     │
│ 4   │ 4     │ 1.1     │
│ 5   │ 5     │ 1.5     │
│ 6   │ 6     │ 1.5     │
│ 7   │ 7     │ 2.0     │
│ 8   │ 8     │ 2.0     │
│ 9   │ 9     │ 1.8     │
│ 10  │ 10    │ 1.7     │
│ 11  │ 11    │ 1.9     │
│ 12  │ 12    │ 1.9     │

Just the source code:

using DataFrames
using SplitApplyCombine: innerjoin

exchange_rates = DataFrame(time = [1,4,5,7,9,10,11], rate = [1.0,1.1,1.5,2.0,1.8,1.7,1.9])
trades = DataFrame(time = 1:12)

trade_time(trade::DataFrameRow)::Int64 = trade.time

exchange_rate_time(exchange_rate::DataFrameRow)::Int64 = exchange_rate.time

trade_and_exchange_rate(trade::DataFrameRow, exchange_rate::DataFrameRow)::NamedTuple{(:trade, :rate),Tuple{Int64,Float64}} = (trade = trade.time, rate = exchange_rate.rate)

exchange_rate_time_search(exchange_rate_times::Array{Int64,1}, trade_time::Int64)::Int64 = maximum(filter(time -> time <= trade_time, exchange_rate_times))

trade_exchange_rate_join_condition(trade_time::Int64, exchange_rate_time::Int64)::Bool = exchange_rate_time == exchange_rate_time_search(exchange_rates.time, trade_time)

trades_and_exchange_rates = DataFrame(
  innerjoin(
    trade_time,
    exchange_rate_time,
    trade_and_exchange_rate,
    trade_exchange_rate_join_condition,
    eachrow(trades),
    eachrow(exchange_rates)
  )
)