Write a table into SQLite and DuckDB in efficient manner

I’m not sure how important having contiguous or low-latency access to elements in the vector would be. Probably some AbstractVectors would not qualify. The code lives in duckdb/data_frame_scan.jl at master · duckdb/duckdb · GitHub, so people more familiar with the data ecosystem than I can determine what would be needed for broader Tables.jl support.

A cursory look suggests converting that file from dataframes to arbitrary tables (through columntable) is pretty easy. However, I cannot simply run DuckDB.jl tests successfully, all of these fail for me:

  • ]dev DuckDB + ]test DuckDB
  • clone duckdb, go to juliapkg dir, do ]test
  • clone duckdb, go to juliapkg dir, do ./test.sh

Also, their contribution rules aren’t clear: do you know if they even accept outside PRs?

Opened a feature request at DuckDB here - Feature Request: juliapkg - Tables.jl integration · Discussion #6407 · duckdb/duckdb · GitHub

1 Like

Tried once again, and now could run DuckDB tests successfully.
Converting it from dataframes to builtin tables was laborious but straightforward: integrate with julia tables by aplavin · Pull Request #7984 · duckdb/duckdb · GitHub.
This PR makes DuckDB.jl both more general (support any table), and much lighter (# deps is halved). Hope it will be accepted.

3 Likes

If you need to append data to an existing table, you can also use the appender api to write data rather than executemany queries. In my test, that improved the performance significantly to be better than sqlite.

SQLite (2nd run):

str = join(repeat('?', length(table_data)), ',')
write_query = DBInterface.prepare(con_sqlite, "INSERT INTO data VALUES($str)")
@time DBInterface.executemany(write_query, table_data) 
# second run: 0.009210 seconds (159.03 k allocations: 7.922 MiB)

Duckdb appender:

using DataFrames
db = DuckDB.DB("test1.duckdb")
appender = DuckDB.Appender(db, "data")
table_frame = rowtable(table_data) |> DataFrames.DataFrame;
@time begin
for i in eachrow(table_frame)
    for j in i
        DuckDB.append(appender, j)
    end
    DuckDB.end_row(appender)
end
DuckDB.flush(appender)
# second run: 0.005218 seconds (117.96 k allocations: 3.021 MiB)
3 Likes