How can I efficiently populate DB table from very long tables (say, millions rows)? I try to compare SQlite vs DuckDB in this scenario, but I think I’m doing this very inefficiently:
using SQLite, DuckDB, Tables
con_sqlite = DBInterface.connect(SQLite.DB, "test.sqlite3")
con_duckdb = DBInterface.connect(DuckDB.DB, "test1.duckdb")
len = 10_000
table_data = (a = collect(1:len), b = rand(1:100, len))
# create a table
create_query = map((k,v)->"$k INT NOT NULL,", keys(table_data), table_data) |> # $(SQLite.sqlitetype(eltype(v)))
x->join(x, "\r\n")[1:end-1] |>
x->"CREATE TABLE data(\r\n$x\r\n);" #
DBInterface.execute(con_sqlite, create_query)
DBInterface.execute(con_duckdb, create_query)
# write file
str = join(repeat('?', length(table_data)), ',')
write_query = DBInterface.prepare(con_sqlite, "INSERT INTO data VALUES($str)")
@time DBInterface.executemany(write_query, table_data)
# first run: 0.211553 seconds (440.17 k allocations: 24.565 MiB, 95.13% compilation time)
# second run: 0.010278 seconds (89.52 k allocations: 6.404 MiB
write_query = DBInterface.prepare(con_duckdb, "INSERT INTO data VALUES($str)")
@time DBInterface.executemany(write_query, table_data)
# first run: 3.263211 seconds (1.10 M allocations: 80.216 MiB, 0.40% gc time, 6.34% compilation time)
# second run: 8.388220 seconds (797.40 k allocations: 64.408 MiB, 0.23% gc time)
# read file
@time table_rd = DBInterface.execute(con_sqlite, "SELECT * FROM data") |> columntable
# first run: 0.874020 seconds (3.02 M allocations: 156.725 MiB, 8.00% gc time, 96.61% compilation time)
# second run: 0.011222 seconds (78.53 k allocations: 3.363 MiB)
@time table_rd = DBInterface.execute(con_duckdb, "SELECT * FROM data") |> columntable
# first run: 0.553115 seconds (1.93 M allocations: 98.399 MiB, 4.10% gc time, 99.20% compilation time)
# second run: 0.005292 seconds (58.69 k allocations: 1.397 MiB)
DBInterface.close!(con_sqlite)
DBInterface.close!(con_duckdb)
Inserting one million rows within one transaction (which I think executemany does) takes a whole amount of time. In this example I tried to insert just 10000 rows and DuckDB took from 3 to 8 seconds (surprisingly, the second run takes longer on fresh DB file).
Here’s a much faster way to load for DuckDB - it requires the data to be in a DataFrames.jl dataframe.
using DataFrames
df = DataFrame(table_data)
# doesn't load the data, but makes the dataframe visible as a view in DuckDB
DuckDB.register_data_frame(con_duckdb, df, "table_data")
# create a table from that view and store in the DB
DBInterface.execute(con_duckdb,
"CREATE TABLE data AS SELECT * from table_data"
)
SQLite.jl seems generally quite fast in my experience (though could be made even faster with better type stability). Maybe DuckDB.jl just isn’t optimized much for interop with Julia tables yet…
Especially, writing is very slow in your bechmark, can putting the executemany call in an explicit transaction help?
I’ve been eyeing DuckDB to replace my usage of SQLite, so also looking forward for advices on this workflow!
To my understanding what @chris-b1 proposes is a fast way (and probably preferred if someone uses DataFrames.jl). Registering data frame in DuckDB does not copy.
Would be great to have a general and fast way to insert rows into DuckDB - like
that already works for many databases. What makes it orders of magnitude slower for DuckDB?
Yes, DuckDB provides its own load! function - but it only supports DataFrames, and requires all tables to be converted to that type. Where is Julia/Tables.jl composability? (:
Btw, its argument order differs from SQLite.load! - yeah, this function isn’t in DBInterface so they don’t really have to follow anything, but consistency would be nice.
At least reading DuckDB tables seems faster than SQLite (see 1st post) - as could be expected.
Whelp, I guess the inconsistent argument order is actually my fault since I PR’d that function Could still be changed, I don’t think many people know that exists… But to be fair, it is SQLite.jl being inconsistent with the ! convention in the first place, because the thing that is being modified is the database, not the inserted table.
@quinnj - indeed it would be good to make it consistent. Especially, as I expect that DuckDB.jl will in the future be a very important component of JuliaData ecosystem (DuckDB is constantly getting loads of excellent features). Maybe it would even make sense to define the signature in DBInterface to make sure every database that supports it does it consistently?
For now I’ve been just evaluating DuckDB.jl for potential use, not using it myself. Looks like it’s very tightly tied to DataFrames unlike other database packages. Even taking DataFrames as a dependency seems weird for a database package (it’s quite a heavy dep), even more so - exposing/requiring those types in the interface.
I guess I’ll just continue using mature sqlite and SQLite.jl, leaving some performance on the table in the rare case of large databases. Hoping for someone more familiar with duckdb/DuckDB.jl to make that package more integrated with Julia types/tables.
I remember seeing an issue in the DuckDB repo about adoption of the Tables.jl interface but couldn’t find it anymore. Would definitely be nice to have, maybe alongside integration with Arrow.jl, which is already available for python and R.
Column vs row storage makes a difference here. If I’m not mistaken, other column stores like Clickhouse have similar limitations.
One reason could be that the fast data insert path of registering an external table and copying from it relies on a particular memory layout. Tables.jl doesn’t expose enough functionality to figure that out easily, but DataFrames should. In the meantime, I noticed DuckDB.jl does expose a low-level appender API which should work with arbitrary data.
Fair point! Seems like the executemany function cannot take advantage of the whole dataset being available immediately, and runs INSERTs one by one.
Surprised to hear. Many (most? all?) column-based table types in Julia provide an efficient zero-copy Tables.columntable(tbl) conversion that returns a namedtuple of vectors.
Is there anything else needed for efficient load! ?
Interesting. Is that guaranteed to return contiguous Vectors of not-too-exotic types or just any old AbstractVector? If the former, I guess someone would “just” have to generalize the DuckDB table integration to handle this.