Very Slow `DBInterface.execute` on large duckdb relative to R

Hi all,

I’ve made a clone of @grantmcdermott 's nice workshop on duckdb and polars - adding a chapter on julia and duckdb. He had pointed out some very slow julia timings in his slides, and actually created an issue on DBInterface.jl. We don’t have a MWE but we have a WE - this example only makes sense on the big (8.6Gb) dataset we are using. There are straightforward instructions for fast data download contained within the workshop website here. But first the result.

R

from here:

tic = Sys.time()
dat1 = dbGetQuery(
  con,
  "
  FROM 'nyc-taxi/**/*.parquet'
  SELECT
    passenger_count,
    AVG(tip_amount) AS mean_tip
  GROUP BY ALL
  ORDER BY ALL
  "
)
toc = Sys.time()
toc - tic

Time difference of 2.233316 secs

julia

from here:

time1 = @elapsed dat1 = DBInterface.execute(
    con,
    """
    FROM 'nyc-taxi/**/*.parquet'
    SELECT
        passenger_count,
        AVG(tip_amount) AS mean_tip
    GROUP BY ALL
    ORDER BY ALL
    """
    )

6.27071575

This is all the more surprising as the julia example does not even convert the result to a dataframe yet, while the dbGetQuery call in R does that as well. I find it really puzzling as I was expecting to obtain close to identical timing (given that the actual work is done by duckdb and neither R nor julia.)

Any insights on this one?

1 Like

This is indeed strange. Taking a look into DuckDB.jl, DBInterface.execute for DuckDB.jl is a very simple function that leads to a ccall, and I can’t even imagine it causing a performance bottleneck.

Maybe the issue has something to do with the version of DuckDB used? It seems that DuckDB.jl uses v1.1.0 while duckdb-r uses v1.1.1.

There has been a recent update of DuckDB to v1.1, and some reports of performance issues, so it may be related.

p.s. In my humble opinion, I don’t think DBInterface.jl is to blame at all - it’s just an interface for devs to conform to, and it’s up to the each individual implementer of DBInterface how it works internally. Plus, it seems that DBInterface.execute in this case entirely depends on the implementation in DuckDB.jl.

1 Like

Ok thanks for looking at this. Yeah I also don’t think it’s the interface at fault but I had no idea where else to start asking about this.

1 Like

duckdb will default to the number of threads in Julia process (1) whereas R will use the number of cores on your machine. Try to make them equal beforehand with “SET theads to 1”. Also you can call DuckDB.execute() and skip DBInterface if you want as well.

2 Likes

bingo! thanks a lot. identical timings with identical number of threads. great.

1 Like