Strategy for debugging heisenbug involving CSV and DuckDB?

I’ve been trying to come to grips with some very weird behaviour when using CSV.jl and DuckDB.jl in the same script. Basically I want to parse some large CSV files and store their data in a DuckDB database file using the Appender API.

However, the behaviour I’m seeing is that the appending sometimes fails with an error and nothing ends up in the .db file. But the behaviour is sensitive to extremely weird aspects of the script. I managed to reduce it to the full script shown all the way at the bottom (including reducing the schema and data used), but here’s the main part:

function go(db_file, csv_file)

    db = DuckDB.DB(db_file)
    DBInterface.execute(db, "create table if not exists jobs (year int not null)")

    appender = DuckDB.Appender(db, "jobs")

    # CSV.read call needs to be after creating the Appender for the bug to show up...
    println(csv_file)
    # Note: df result is not used at all!
    df = CSV.read(csv_file, DataFrame, delim="\t")
    #println(df)

    row = (2026,)
    for (idx, value) in enumerate(row)
        res = DuckDB.append(appender, value)
        check_for_error(appender, "append $idx", res)
    end

    #DuckDB.end_row(appender)
    res = DuckDB.duckdb_appender_end_row(appender.handle)
    check_for_error(appender, "end_row", res)

    res = DuckDB.duckdb_appender_close(appender.handle)
    check_for_error(appender, "close", res)
end

go(db_file, csv_file)

If I run this it shows an error from the appender API and the .db file contains no rows:

melis@blackbox 08:16:~/concepts/snellius-job-stats$ j --project -t1 bug_db_import_csv_file.jl test.db doh2.csv 
doh2.csv
Appender error (close): Appender: Attempting to flush data to a closed connection

melis@blackbox 08:16:~/concepts/snellius-job-stats$ duckdb test.db .dump
BEGIN TRANSACTION;
CREATE TABLE jobs("year" INTEGER NOT NULL);;
COMMIT;

But note that in this testcase script the data read from the CSV file into df is not used at all to fill the database (row is hardcoded here, while in my original production code df was iterated over). However, the CSV.read() call is needed for the DuckDB appender error to show up. Moving the CSV.read() call to happen before DuckDB.DB() or just before DuckDB.Appender() makes the error go away (and then data ends up in the database file).

I have been able to reduce the testcase and data from the original production case. But this one has been extremely elusive to pin down, with seemingly unrelated code having an influence on the occurrence of the error, such as if the df value is printed, or using CSV.read("doh2.csv", ... instead of CSV.read(csv_file, ... (where csv_file == "doh2.csv", yes, really).

This feels like some kind of memory corruption or even GC bug to me. I could use some help which step(s) would make sense to further figure this out. I’m not sure if the issue is in DuckDB, CSV, or even Julia itself. I can build Julia from source with some debugging flags enabled, but would need a bit of advice on what to use, and any help is appreciated.

Some more details:

The CSV file only contains 258 lines, and is around 115kB, but was derived from around 38k lines and is the smallest file I could create that shows the issue.

Julia info:

Julia Version 1.12.6
Commit 15346901f00 (2026-04-09 19:20 UTC)
Build Info:
  Official https://julialang.org release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 8 × Intel(R) Core(TM) i7-6700K CPU @ 4.00GHz
  WORD_SIZE: 64
  LLVM: libLLVM-18.1.7 (ORCJIT, skylake)
  GC: Built with stock GC
Threads: 8 default, 1 interactive, 8 GC (on 8 virtual cores)
Environment:
  JULIA_PKG_USE_CLI_GIT = true
  JULIA_NUM_THREADS = auto

Status `~/concepts/snellius-job-stats/Project.toml`
  [68d1b5e8] AdaptiveKDTrees v0.1.0
  [c7e460c6] ArgParse v1.2.0
  [69666777] Arrow v2.8.1
  [6e4b80f9] BenchmarkTools v1.8.0
  [336ed68f] CSV v0.10.16
  [159f3aea] Cairo v1.1.1
  [13f3f980] CairoMakie v0.15.10
  [324d7699] CategoricalArrays v1.1.1
  [8be319e6] Chain v1.0.0
  [35d6a980] ColorSchemes v3.31.0
  [861a8166] Combinatorics v1.1.0
  [a93c6f00] DataFrames v1.8.2
  [d2f5444f] DuckDB v1.5.2
  [cf6cc811] FunSQL v0.15.0
  [e9467ef8] GLMakie v0.13.10
  [1ecd5474] GraphMakie v0.6.3
  [86223c79] Graphs v1.14.0
⌅ [cd3eb016] HTTP v1.11.0
  [dc4a2547] J3 v0.2.5 `~/concepts/j3`
  [682c06a0] JSON v1.6.0
  [442fdcdd] Measures v0.3.3
⌅ [fa8bd995] MetaGraphsNext v0.7.5
  [39abe10b] MySQL v1.5.1
  [c020b1a1] NaturalSort v1.0.0
  [b8a86587] NearestNeighbors v0.4.27
  [46757867] NetworkLayout v0.4.10
  [9b87118b] PackageCompiler v2.3.0
  [626c502c] Parquet v0.8.6
  [995b91a9] PlotUtils v1.4.4
  [91a5bcdd] Plots v1.41.6
  [08abe8d2] PrettyTables v3.3.2
  [c4c386cf] Rsvg v1.1.0
  [e6c19c8d] SGtSNEpi v0.4.0
  [0aa819cd] SQLite v1.8.0
  [8fd88ec8] SankeyPlots v0.3.0
  [10745b16] Statistics v1.11.1
  [2913bbd2] StatsBase v0.34.11
  [f3b207a7] StatsPlots v0.15.8
  [b8865327] UnicodePlots v3.8.3
  [ade2ca70] Dates v1.11.0
  [de0858da] Printf v1.11.0
Info Packages marked with ⌅ have new versions available but compatibility constraints restrict them from upgrading. To see why use `status --outdated`

Full script:

using CSV, DataFrames, DuckDB

if length(ARGS) < 2 || (length(ARGS) == 1 && ARGS[1] == "-h")
    println("$(PROGRAM_FILE) file.db file.csv[.gz]")
    println()
    exit(-1)
end

db_file = ARGS[1]
csv_file = ARGS[2]

function check_for_error(appender, msg, res)
    if res == DuckDB.DuckDBSuccess
        return
    end

    # XXX not defined
    #error_data = DuckDB.duckdb_appender_error_data(appender.handle)
    #error_ptr = error_data.internal_pointer
    #DuckDB.duckdb_destroy_error_data(error_data)

    # XXX is deprecated
    error_ptr = DuckDB.duckdb_appender_error(appender.handle)
    if error_ptr == C_NULL
        error_message = "unknown error"
    else
        # XXX Sigh, https://github.com/duckdb/duckdb/pull/7427
        error_message = unsafe_string(error_ptr)
    end
    println("Appender error ($msg): $error_message")
end

function go(db_file, csv_file)

    db = DuckDB.DB(db_file)
    DBInterface.execute(db, "create table if not exists jobs (year int not null)")

    appender = DuckDB.Appender(db, "jobs")

    # CSV.read call needs to be after creating the Appender for the bug to show up...
    println(csv_file)
    df = CSV.read(csv_file, DataFrame, delim="\t")
    #println(df)

    row = (2026,)
    for (idx, value) in enumerate(row)
        res = DuckDB.append(appender, value)
        check_for_error(appender, "append $idx", res)
    end

    #DuckDB.end_row(appender)
    res = DuckDB.duckdb_appender_end_row(appender.handle)
    check_for_error(appender, "end_row", res)

    #res = DuckDB.duckdb_appender_flush(appender.handle)
    #check_for_error(appender, "flush end of $fname", res)

    #DuckDB.flush(appender)
    #res = DuckDB.duckdb_appender_flush(appender.handle)
    #check_for_error(appender, "flush", res)

    #DuckDB.close(appender)
    res = DuckDB.duckdb_appender_close(appender.handle)
    check_for_error(appender, "close", res)
end

go(db_file, csv_file)

Did some more testing: adding GC.enable(false) at the top makes the appender error go away, even in my original production code.

I can reproduce with your code and some ginned up data, and I think I have a fix. Got a write of a non-zero file.

using CSV, DataFrames, DuckDB
db_file = "donald.db"
csv_file = "duck.tsv"

function check_for_error(appender, msg, res)
    if res == DuckDB.DuckDBSuccess
        return
    end

    # XXX not defined
    #error_data = DuckDB.duckdb_appender_error_data(appender.handle)
    #error_ptr = error_data.internal_pointer
    #DuckDB.duckdb_destroy_error_data(error_data)

    # XXX is deprecated
    error_ptr = DuckDB.duckdb_appender_error(appender.handle)
    if error_ptr == C_NULL
        error_message = "unknown error"
    else
        # XXX Sigh, https://github.com/duckdb/duckdb/pull/7427
        error_message = unsafe_string(error_ptr)
    end
    println("Appender error ($msg): $error_message")
end

function go(db_file, csv_file)

    db = DuckDB.DB(db_file)
    DBInterface.execute(db, "create table if not exists jobs (year int not null)")

    appender = DuckDB.Appender(db, "jobs")

    # CSV.read call needs to be after creating the Appender for the bug to show up...
    println(csv_file)
    df = CSV.read(csv_file, DataFrame, delim="\t")
    #println(df)

    row = (2026,)
    for (idx, value) in enumerate(row)
        res = DuckDB.append(appender, value)
        check_for_error(appender, "append $idx", res)
    end

    #DuckDB.end_row(appender)
    res = DuckDB.duckdb_appender_end_row(appender.handle)
    check_for_error(appender, "end_row", res)

    #res = DuckDB.duckdb_appender_flush(appender.handle)
    #check_for_error(appender, "flush end of $fname", res)

    #DuckDB.flush(appender)
    #res = DuckDB.duckdb_appender_flush(appender.handle)
    #check_for_error(appender, "flush", res)

    #DuckDB.close(appender)
    res = DuckDB.duckdb_appender_close(appender.handle)
    check_for_error(appender, "close", res)
end

go(db_file, csv_file) # same error

function go2(db_file, csv_file)
    db = DuckDB.DB(db_file)
    try
        DBInterface.execute(db, "create table if not exists jobs (year int not null)")
        df = CSV.read(csv_file, DataFrame; delim='\t')
        appender = DuckDB.Appender(db, "jobs")
        GC.@preserve db begin                 # pin db across the appender's lifetime
            DuckDB.append(appender, Int32(2026))
            DuckDB.end_row(appender)
            DuckDB.close(appender)            # flush + destroy via the wrapper
        end
    finally
        DuckDB.close(db)                      # checkpoint; don't wait on the finalizer
    end
end
go2(db_file, csv_file) # creates file

Well, it’s good to have a workaround, but it seems the Appender code is not doing the right thing by not keeping a reference to the DB (plus some other things that need fixing). I submitted Fixes to Julia Appendar code (untested) · Pull Request #22969 · duckdb/duckdb, but need to figure out how to run the Julia tests as part of the DuckDB build