Write a table into SQLite and DuckDB in efficient manner

sairus7 · January 23, 2023, 5:58pm

How can I efficiently populate DB table from very long tables (say, millions rows)? I try to compare SQlite vs DuckDB in this scenario, but I think I’m doing this very inefficiently:

using SQLite, DuckDB, Tables

con_sqlite = DBInterface.connect(SQLite.DB, "test.sqlite3")
con_duckdb = DBInterface.connect(DuckDB.DB, "test1.duckdb")

len = 10_000
table_data = (a = collect(1:len), b = rand(1:100, len))

# create a table
create_query = map((k,v)->"$k INT NOT NULL,", keys(table_data), table_data) |> # $(SQLite.sqlitetype(eltype(v)))
    x->join(x, "\r\n")[1:end-1] |>
    x->"CREATE TABLE data(\r\n$x\r\n);" # 
DBInterface.execute(con_sqlite, create_query)
DBInterface.execute(con_duckdb, create_query)

# write file
str = join(repeat('?', length(table_data)), ',')
write_query = DBInterface.prepare(con_sqlite, "INSERT INTO data VALUES($str)")
@time DBInterface.executemany(write_query, table_data) 
# first run: 0.211553 seconds (440.17 k allocations: 24.565 MiB, 95.13% compilation time)
# second run: 0.010278 seconds (89.52 k allocations: 6.404 MiB

write_query = DBInterface.prepare(con_duckdb, "INSERT INTO data VALUES($str)")
@time DBInterface.executemany(write_query, table_data) 
# first run: 3.263211 seconds (1.10 M allocations: 80.216 MiB, 0.40% gc time, 6.34% compilation time)
# second run: 8.388220 seconds (797.40 k allocations: 64.408 MiB, 0.23% gc time)

# read file
@time table_rd = DBInterface.execute(con_sqlite, "SELECT * FROM data") |> columntable 
# first run: 0.874020 seconds (3.02 M allocations: 156.725 MiB, 8.00% gc time, 96.61% compilation time)
# second run: 0.011222 seconds (78.53 k allocations: 3.363 MiB)

@time table_rd = DBInterface.execute(con_duckdb, "SELECT * FROM data") |> columntable 
# first run: 0.553115 seconds (1.93 M allocations: 98.399 MiB, 4.10% gc time, 99.20% compilation time)
# second run: 0.005292 seconds (58.69 k allocations: 1.397 MiB)

DBInterface.close!(con_sqlite)
DBInterface.close!(con_duckdb)

Inserting one million rows within one transaction (which I think executemany does) takes a whole amount of time. In this example I tried to insert just 10000 rows and DuckDB took from 3 to 8 seconds (surprisingly, the second run takes longer on fresh DB file).

chris-b1 · January 23, 2023, 7:06pm

Here’s a much faster way to load for DuckDB - it requires the data to be in a DataFrames.jl dataframe.

using DataFrames
df = DataFrame(table_data)
# doesn't load the data, but makes the dataframe visible as a view in DuckDB
DuckDB.register_data_frame(con_duckdb, df, "table_data") 

# create a table from that view and store in the DB
DBInterface.execute(con_duckdb, 
    "CREATE TABLE data AS SELECT * from table_data"
)

sairus7 · January 23, 2023, 7:40pm

Thanks, but I am checking usability of DuckDB as a storage format. not just a query layer over DataFrames.

chris-b1 · January 23, 2023, 8:44pm

That second command ("CREATE TABLE AS") does store the dataframe into the duck db file/format.

aplavin · January 23, 2023, 8:51pm

SQLite.jl seems generally quite fast in my experience (though could be made even faster with better type stability). Maybe DuckDB.jl just isn’t optimized much for interop with Julia tables yet…
Especially, writing is very slow in your bechmark, can putting the executemany call in an explicit transaction help?

I’ve been eyeing DuckDB to replace my usage of SQLite, so also looking forward for advices on this workflow!

bkamins · January 23, 2023, 8:56pm

To my understanding what @chris-b1 proposes is a fast way (and probably preferred if someone uses DataFrames.jl). Registering data frame in DuckDB does not copy.

Jakob · January 24, 2023, 6:42am

There is also DuckDB.load!(con, df, table) mimicking the SQLite.jl interface, which just does the above mentioned register/unregister of a DataFrame.

aplavin · January 24, 2023, 2:09pm

Would be great to have a general and fast way to insert rows into DuckDB - like

sairus7:

str = join(repeat('?', length(table_data)), ',')
write_query = DBInterface.prepare(con_sqlite, "INSERT INTO data VALUES($str)")
@time DBInterface.executemany(write_query, table_data)

that already works for many databases. What makes it orders of magnitude slower for DuckDB?

Yes, DuckDB provides its own load! function - but it only supports DataFrames, and requires all tables to be converted to that type. Where is Julia/Tables.jl composability? (:
Btw, its argument order differs from SQLite.load! - yeah, this function isn’t in DBInterface so they don’t really have to follow anything, but consistency would be nice.

At least reading DuckDB tables seems faster than SQLite (see 1st post) - as could be expected.

sairus7 · January 24, 2023, 2:44pm

Yes, that’s strange to support only DataFrames and not any columntable-compatible source.

Jakob · January 24, 2023, 3:08pm

Whelp, I guess the inconsistent argument order is actually my fault since I PR’d that function Could still be changed, I don’t think many people know that exists… But to be fair, it is SQLite.jl being inconsistent with the ! convention in the first place, because the thing that is being modified is the database, not the inserted table.

bkamins · January 24, 2023, 3:33pm

I agree. @aplavin - would you be willing to make a PR to DuckDB.jl that would resolve this issue?

bkamins · January 24, 2023, 3:34pm

@quinnj - indeed it would be good to make it consistent. Especially, as I expect that DuckDB.jl will in the future be a very important component of JuliaData ecosystem (DuckDB is constantly getting loads of excellent features). Maybe it would even make sense to define the signature in DBInterface to make sure every database that supports it does it consistently?

aplavin · January 24, 2023, 4:08pm

For now I’ve been just evaluating DuckDB.jl for potential use, not using it myself. Looks like it’s very tightly tied to DataFrames unlike other database packages. Even taking DataFrames as a dependency seems weird for a database package (it’s quite a heavy dep), even more so - exposing/requiring those types in the interface.
I guess I’ll just continue using mature sqlite and SQLite.jl, leaving some performance on the table in the rare case of large databases. Hoping for someone more familiar with duckdb/DuckDB.jl to make that package more integrated with Julia types/tables.

Jakob · January 24, 2023, 7:00pm

I remember seeing an issue in the DuckDB repo about adoption of the Tables.jl interface but couldn’t find it anymore. Would definitely be nice to have, maybe alongside integration with Arrow.jl, which is already available for python and R.

ToucheSir · January 24, 2023, 10:31pm

Column vs row storage makes a difference here. If I’m not mistaken, other column stores like Clickhouse have similar limitations.

One reason could be that the fast data insert path of registering an external table and copying from it relies on a particular memory layout. Tables.jl doesn’t expose enough functionality to figure that out easily, but DataFrames should. In the meantime, I noticed DuckDB.jl does expose a low-level appender API which should work with arbitrary data.

aplavin · January 25, 2023, 5:34pm

Fair point! Seems like the executemany function cannot take advantage of the whole dataset being available immediately, and runs INSERTs one by one.

Surprised to hear. Many (most? all?) column-based table types in Julia provide an efficient zero-copy Tables.columntable(tbl) conversion that returns a namedtuple of vectors.
Is there anything else needed for efficient load! ?

sairus7 · January 25, 2023, 5:42pm

Looks like this code handles dataframe registration. I don’t see here any methods not covered with Tables interface.

github.com

duckdb/duckdb/blob/master/tools/juliapkg/src/data_frame_scan.jl

using DataFrames

struct DFBindInfo
    df::DataFrame
    input_columns::Vector
    scan_types::Vector{Type}
    result_types::Vector{Type}
    scan_functions::Vector{Function}

    function DFBindInfo(
        df::DataFrame,
        input_columns::Vector,
        scan_types::Vector{Type},
        result_types::Vector{Type},
        scan_functions::Vector{Function}
    )
        return new(df, input_columns, scan_types, result_types, scan_functions)
    end
end

This file has been truncated. show original

ToucheSir · January 25, 2023, 8:16pm

Interesting. Is that guaranteed to return contiguous Vectors of not-too-exotic types or just any old AbstractVector? If the former, I guess someone would “just” have to generalize the DuckDB table integration to handle this.

bkamins · January 25, 2023, 8:37pm

There is a bug in Tables.columntable documentation. It does not guarantee Vector currently. Tables.columntable contract · Issue #324 · JuliaData/Tables.jl · GitHub

aplavin · January 26, 2023, 4:46am

Surely it can return AbstractVectors. Same as eg DataFrame columns aren’t required to be Vectors (right?).

Topic		Replies	Views
DuckDB, FunSQL.jl, reading data Data sql , duckdb	0	111	March 18, 2025
How to execute .sql file Data	10	284	December 11, 2024
Why Is There a Significant Slow-down with DuckDB.jl Bridging? General Usage performance , sql , database , duckdb	1	678	January 30, 2024
What's the latest and greatest in data in Julia Data	29	2112	August 15, 2024
Very Slow `DBInterface.execute` on large duckdb relative to R Performance juliadb , database , duckdb	4	140	October 30, 2024

Write a table into SQLite and DuckDB in efficient manner

Related topics