I am having an issue with memory allocation using the DuckDB.jl
package. I am querying subsets of a large table (in a database stored in a file) and writing the results to a file. Given a table with columns id1
, id2
, first_date
, last_date
the following eventually runs out of memory:
con = db.connect(DuckDB.DB, "database.db")
DBInterface.execute(con, "PRAGMA threads=8;")
for curdate in query_dates
DBInterface.execute(con,
"""
COPY
(SELECT
id1,
id2
FROM my_large_table
WHERE (CAST('$curdate' AS DATE) + INTERVAL 6 DAY) >= first_date
AND '$curdate' <= last_date)
TO 'data/$(curdate).parquet' (COMPRESSION ZSTD);
"""
)
end
An equivalent query using python works (fills cache but frees memory when needed)
# %%
import duckdb
from tqdm import tqdm
con = duckdb.connect(
'database.db',
read_only = True)
def make_querystring(curdate):
return f"""
COPY
(
SELECT
id1,
id2
FROM my_large_table
WHERE (CAST('{curdate}' AS DATE) + INTERVAL 6 DAY) >= first_date
AND '{curdate}' <= last_date
)
TO 'data/{curdate}.parquet' (COMPRESSION ZSTD);
"""
for cur_date in tqdm(date_list):
cur_q: str = make_querystring(cur_date)
con.sql(cur_q)
Any ideas how I can force DuckDB.jl
to free memory (closing the result or the database after every query did not work in my test)?
Thanks!