Very Slow `DBInterface.execute` on large duckdb relative to R

floswald · October 30, 2024, 8:26am

Hi all,

I’ve made a clone of @grantmcdermott 's nice workshop on duckdb and polars - adding a chapter on julia and duckdb. He had pointed out some very slow julia timings in his slides, and actually created an issue on DBInterface.jl. We don’t have a MWE but we have a WE - this example only makes sense on the big (8.6Gb) dataset we are using. There are straightforward instructions for fast data download contained within the workshop website here. But first the result.

R

from here:

tic = Sys.time()
dat1 = dbGetQuery(
  con,
  "
  FROM 'nyc-taxi/**/*.parquet'
  SELECT
    passenger_count,
    AVG(tip_amount) AS mean_tip
  GROUP BY ALL
  ORDER BY ALL
  "
)
toc = Sys.time()
toc - tic

Time difference of 2.233316 secs

julia

from here:

time1 = @elapsed dat1 = DBInterface.execute(
    con,
    """
    FROM 'nyc-taxi/**/*.parquet'
    SELECT
        passenger_count,
        AVG(tip_amount) AS mean_tip
    GROUP BY ALL
    ORDER BY ALL
    """
    )

6.27071575

This is all the more surprising as the julia example does not even convert the result to a dataframe yet, while the dbGetQuery call in R does that as well. I find it really puzzling as I was expecting to obtain close to identical timing (given that the actual work is done by duckdb and neither R nor julia.)

Any insights on this one?

littlelib · October 30, 2024, 11:36am

This is indeed strange. Taking a look into DuckDB.jl, DBInterface.execute for DuckDB.jl is a very simple function that leads to a ccall, and I can’t even imagine it causing a performance bottleneck.

Maybe the issue has something to do with the version of DuckDB used? It seems that DuckDB.jl uses v1.1.0 while duckdb-r uses v1.1.1.

github.com/duckdb/duckdb

1.1.0 Performance regression (or my SQL is bad?)

opened 11:00PM - 09 Sep 24 UTC

tmontes

reproduced

### What happens? (interesting coincidence) Earlier today, I created a que…ry that aggregates the join between a hive-partitioned Parquet dataset and a PostgreSQL table - I found it very fast, as was my expectation. Having heard about the release of 1.1.0, I quickly tried it, only to find that its performance was substantially slower. After adjusting the query, performance was regained and better than 1.0.0. This is running `duckdb` 1.1.0 vs 1.0.0 CLI on Rocky Linux 9 with 8x CPU, 32 GB RAM. ### To Reproduce PostgreSQL is running on the same host as `duckdb` CLI: holds a reference table with ~100 rows. The Parquet dataset is hosted on a MiniIO server on the LAN. Partitioned by `year` and `filename`: the partition we're using holds 2.8 GB in 99 Parquet files. I cannot share the exact query due to privacy issues: I've thusly renamed mostly everything. Here it is: ``` -- PostgreSQL has a reference.products table we need -- ------------------------------------------------- -- - It has under 100 rows. ATTACH '' AS psql (TYPE postgres, READ_ONLY); -- Sales data is on hive-partitioned Parquet dataset on MinIO -- ---------------------------------------------------------- -- - Partitions are /year={yyyy}/filename={filename}/{data_file}.parquet. -- - The partition we're querying holds 2.8 GB in 99 Parquet files. CREATE OR REPLACE SECRET my_secret ( TYPE S3, KEY_ID '...', SECRET '...', SCOPE 's3://bucket-name', ENDPOINT 'my-minio.fully.qualified.name', URL_STYLE 'path' ); -- Here we go -- ---------- -- Aggregate 2022 daily sales metrics for 'simple'-typed products. WITH sales AS ( SELECT * FROM parquet_scan( 's3://bucket-name/key/**/*.parquet', hive_partitioning=true ) ), products AS ( SELECT id, type FROM psql.reference.products ), c_d_map AS ( SELECT * FROM (VALUES ('A', 1), ('B', 1), ('C', -1), ('D', -1), ) AS t(c_d, value) ) SELECT s.client_id, date_trunc('day', s.ts) AS day, SUM(s.qty) AS qty_sum, SUM(s.g_a) AS g_a_sum, SUM(s.i_v) AS i_v_sum, SUM(s.v_v) AS v_v_sum, SUM(s.n_v) AS n_v_sum, SUM(s.s) AS s_sum, SUM(s.rebate) AS rebate_sum, SUM(COALESCE(m.value, 0)) AS cred_deb_vsum, COUNT(DISTINCT s.card) AS count_distinct_card, SUM(s.v_e_r) AS v_e_r_sum, SUM(s.c) AS c_sum, SUM(s.c_r) AS c_r_sum FROM sales s LEFT JOIN products p ON (s.product_id = p.id) JOIN c_d_map m ON (s.c_d = m.c_d) WHERE s.year = 2022 AND p.type = 'simple' GROUP BY ALL ORDER BY 1, 2 ; ``` Result: * 7 301 457 rows. * `1.0.0` takes ~17 seconds. * `1.1.0` takes ~40 seconds. Converting the `JOIN c_d_map` into a `LEFT JOIN` changes things for the better. Here's a summary of my observations: | | duration (s) | max RAM (GB) | CPU behaviour | |--------|----------------------------|-----------------|---------------------| | `1.0.0` JOIN | ~17 | ~6.2 | ~300% most of the time, 600-700% near the end | | `1.1.0` JOIN | ~40 | ~7.5 | ~100% most of the time, 50% for a while, 600-700% near the end | | `1.0.0` LEFT JOIN | ~17 | ~6.2 | ~300% most of the time, 600-700% near the end | | `1.1.0` LEFT JOIN | ~14 | ~7.9 | ~300% most of the time, 600-700% near the end | * duration is real time obtained from `time duckdb < query.sql`. * max RAM and CPU behaviour is observed via `top`, so it's an approximation. More: * I ran EXPLAIN and from what I can tell the plans are the same. * Given that the semantic is the same (the joined column always joins), I went for a JOIN instead of a LEFT JOIN, thus bumping into this. Wondering: * Could it be to the new CTE materialization mechanisms in place? Parting words: * Thanks for sharing such a wonderful and powerful tool. * If I can assist it preventing others from experiencing what could be considered a performance regression, I'll be happy to assist. ### OS: Rocky Linux 9 x86_64 ### DuckDB Version: 1.1.0 ### DuckDB Client: CLI ### Hardware: VM with 32 GB RAM and 8x CPU ### Full Name: Tiago Montes ### Affiliation: Freelance Consultant ### What is the latest build you tested with? If possible, we recommend testing with the latest nightly build. I have tested with a stable release ### Did you include all relevant data sets for reproducing the issue? No - I cannot share the data sets because they are confidential ### Did you include all code required to reproduce the issue? - [X] Yes, I have ### Did you include all relevant configuration (e.g., CPU architecture, Python version, Linux distribution) to reproduce the issue? - [X] Yes, I have

There has been a recent update of DuckDB to v1.1, and some reports of performance issues, so it may be related.

p.s. In my humble opinion, I don’t think DBInterface.jl is to blame at all - it’s just an interface for devs to conform to, and it’s up to the each individual implementer of DBInterface how it works internally. Plus, it seems that DBInterface.execute in this case entirely depends on the implementation in DuckDB.jl.

floswald · October 30, 2024, 1:15pm

Ok thanks for looking at this. Yeah I also don’t think it’s the interface at fault but I had no idea where else to start asking about this.

era127 · October 30, 2024, 5:52pm

duckdb will default to the number of threads in Julia process (1) whereas R will use the number of cores on your machine. Try to make them equal beforehand with “SET theads to 1”. Also you can call DuckDB.execute() and skip DBInterface if you want as well.

floswald · October 30, 2024, 9:11pm

bingo! thanks a lot. identical timings with identical number of threads. great.

Topic		Replies	Views
Why Is There a Significant Slow-down with DuckDB.jl Bridging? General Usage performance , sql , database , duckdb	1	679	January 30, 2024
Is it hard to support Julia UDFs in DuckDB? Data	42	1652	March 18, 2025
Write a table into SQLite and DuckDB in efficient manner Performance help-database , sqlite , sql , database , duckdb	24	3469	December 8, 2023
What's the latest and greatest in data in Julia Data	29	2115	August 15, 2024
Direct interface to Polars Rust library Data question	13	1672	November 9, 2023

Very Slow `DBInterface.execute` on large duckdb relative to R

R

julia

Related topics