How to quickly bulk insert into postgres

This is sort of a shot in the dark, I don’t have any reason to believe there’s a better solution to this but I thought I’d give it a try.

I’m trying to insert data into tables on a postgres database. I’m using the excellent LibPQ.jl doing

Data.stream!(df, LibPQ.Statement, cnxn, str)

where str is an insert statement and df is the DataFrame I want to upload.

Everything works perfectly fine except that it is horrifyingly slow. It’s taking it something like 20 minutes to upload just 5e4 rows. I know that it’s not a connection speed issue since I get plenty of bandwidth on queries. It also seems highly unlikely that this is a Julia or LibPQ performance issue (other than that I might be using a sub-optimal function for doing this).

Does anyone know of some approach that’s not incredibly slow? I feel like this can’t possibly be the way people populate postgres databases…

1 Like

Fastest way is to use copy command: PostgreSQL: Documentation: 9.6: COPY

Also, what kind of insert statement do you use? Insertion row by row is extremely slow, you should use something like INSERT INTO foo(var1, var2) VALUES (1, 2), (2, 4), (3, 6), etc, where each value will create new row.

Just a guess, but it may be due to the need to use a transaction when doing buik insert. I.e. if you first start a transaction, then do your INSERT statement streaming, then COMMIT, usually the database can optimize the bulk insert quite a bit. We had to take this approach in SQLite.jl for performance.

2 Likes

All sound advice, but at this point I’m wondering if I can do any of that with LibPQ. I’m sure that it’s probably so slow mostly because it’s doing it row-wise.

It almost seems as if the only way to do it would be for LibPQ to create an intermediate buffer containing a csv so that a postgres COPY can then be called on it which… just seems insane.

Here’s a very minimal working example, with available technology, that uses COPY instead of INSERT INTO in order to append a DataFrame to an existing PostgreSQL table, assuming all the columns and types are already lined up:

using LibPQ, CSV, DataFrames

function append(conn::LibPQ.Connection, table::String, df::DataFrame)

  # Serialize DataFrame into a String
  buf = IOBuffer()
  CSV.write(buf, df, header=false)

  data = String(take!(buf))

  # Send COPY statement
  result = execute(conn, "COPY $table FROM STDIN (FORMAT csv, HEADER FALSE)")

  status(result) == LibPQ.libpq_c.PGRES_COPY_IN || error("Failed")

  # Send data
  LibPQ.libpq_c.PQputCopyData(conn.conn, data, Int32(sizeof(data))) == 1 || error("Failed")

  # Signal end-of-data
  LibPQ.libpq_c.PQputCopyEnd(conn.conn, LibPQ.libpq_c.NULL) == 1         || error("Failed")
end

In the example the whole DataFrame is converted into one big String before sending it to the database, but it could be easily done incrementally by issuing several
PQputCopyData() for each chunk of data before calling PQputCopyEnd().

1 Like

Awesome, thanks for working on an example. This is very much the kind of thing I had in mind when I made my previous comment. It seems really unfortunate that we’d have to resort to something like this but at this point it seems like it may well be a limitation of libpq. I’m going to open an issue at LibPQ.jl to see if they are interested in adding something like this to the package.

Issue opened.

1 Like

Is this still in the documentation? (https://invenia.github.io/LibPQ.jl/stable/)

For bulk insert you should use LibPQ.load!(table, conn, query) where table is any Tables.jl interface (which include DataFrame).

Though as far as I remember it only uses positional arguments instead of named arguments (the order of the columns instead of the name of the columns), so you have to consider this when writing the query string.

2 Likes

Will that approach with the DataFrame be any way preferable over the copyin approach of the docs copyin = LibPQ.CopyIn("COPY libpqjl_test FROM STDIN (FORMAT CSV);", row_strings)? I am currently struggling with the LibPQ.CopyIn. One there is the way they show on how to make that Iterator, and I am getting error about the UTF8 errors on some of the CSV strings. Should I just put in the DataFrame instead?

error | LibPQ]: CharacterNotInRepertoire: ERROR:  invalid byte sequence for encoding "UTF8": 0xa0
ERROR: CharacterNotInRepertoire: ERROR:  invalid byte sequence for encoding "UTF8": 0xa0
1 Like

Both approaches are valid. Load! is a way to perform repeated INSERTs with a list of argument sets, and since it’s a normal INSERT (just repeated for each argument set) you can use every functionality available (for example Postgres UPSERT).

COPY is a Postgres extension to copy data FROM files TO tables and vice versa. COPY is very fast, but the file has to have the exact structure of the table (especially if it’s binary or CSV without headers).

In this case it’s not using a file directly though (I honestly never trying using it without a file). It’s probably a general error not related to LibPQ though:

https://stackoverflow.com/questions/4867272/invalid-byte-sequence-for-encoding-utf8

And answering which is better, if your data is in a file already and has the same schema of the table then use COPY (it’s faster and more scalable, I use for copying millions of lines and it takes a few seconds), if the data is in memory or you need a more customized insertion use Load!.

2 Likes

Thanks!
I set the encoding as psql=# copy tmp from '/path/to/file.csv' with delimiter ',' csv header encoding 'windows-1251';
from this answer https://stackoverflow.com/a/23794054/410975 and then it worked. I tried iconv and looks like I did not choose the right options. but now everything is loading and copying etc :slight_smile:

1 Like

I am having the following issues,

  1. I can’t use COPY from file at the moment due to permissions (will probably change soon).
  2. When using COPY from STDIN, the table is unchanged and nothing happens. I am using
data = readlines(filepath)
copyin = CopyIn("COPY test FROM STDIN (FORMAT CSV, HEADER);", data)
execute(conn, copyin)
  1. When using load! it takes forever!,
vals = mapreduce(x -> "\$$x",
                 (x, y) -> "$x, $y",
                 1:ncols)
statement = "insert into $name values ($vals);"
execute(conn, "begin;")
load!(data, conn, statement)
execute(conn, "commit;")

The COPY command in PostgreSQL runs under user postgres. I suspect the file must have “postgres” as owner and must be in a directory where postgres can read it.

Did you commit the transaction after running the copy command? From your code above it does not look like it.

I got superuser access to the new postgreSQL server so I was able to use the COPY command passing a file and it worked great.

For the COPY with STDIN is that missing something to commit the transaction? I believe I copy those lines from the documentation.