How to quickly bulk insert into postgres

ExpandingMan · September 19, 2018, 7:07pm

This is sort of a shot in the dark, I don’t have any reason to believe there’s a better solution to this but I thought I’d give it a try.

I’m trying to insert data into tables on a postgres database. I’m using the excellent LibPQ.jl doing

Data.stream!(df, LibPQ.Statement, cnxn, str)

where str is an insert statement and df is the DataFrame I want to upload.

Everything works perfectly fine except that it is horrifyingly slow. It’s taking it something like 20 minutes to upload just 5e4 rows. I know that it’s not a connection speed issue since I get plenty of bandwidth on queries. It also seems highly unlikely that this is a Julia or LibPQ performance issue (other than that I might be using a sub-optimal function for doing this).

Does anyone know of some approach that’s not incredibly slow? I feel like this can’t possibly be the way people populate postgres databases…

Skoffer · September 19, 2018, 7:47pm

Fastest way is to use copy command: PostgreSQL: Documentation: 9.6: COPY

Skoffer · September 19, 2018, 7:54pm

Also, what kind of insert statement do you use? Insertion row by row is extremely slow, you should use something like INSERT INTO foo(var1, var2) VALUES (1, 2), (2, 4), (3, 6), etc, where each value will create new row.

quinnj · September 19, 2018, 7:57pm

Just a guess, but it may be due to the need to use a transaction when doing buik insert. I.e. if you first start a transaction, then do your INSERT statement streaming, then COMMIT, usually the database can optimize the bulk insert quite a bit. We had to take this approach in SQLite.jl for performance.

ExpandingMan · September 19, 2018, 8:15pm

All sound advice, but at this point I’m wondering if I can do any of that with LibPQ. I’m sure that it’s probably so slow mostly because it’s doing it row-wise.

It almost seems as if the only way to do it would be for LibPQ to create an intermediate buffer containing a csv so that a postgres COPY can then be called on it which… just seems insane.

lbilli · September 20, 2018, 1:52pm

Here’s a very minimal working example, with available technology, that uses COPY instead of INSERT INTO in order to append a DataFrame to an existing PostgreSQL table, assuming all the columns and types are already lined up:

using LibPQ, CSV, DataFrames

function append(conn::LibPQ.Connection, table::String, df::DataFrame)

  # Serialize DataFrame into a String
  buf = IOBuffer()
  CSV.write(buf, df, header=false)

  data = String(take!(buf))

  # Send COPY statement
  result = execute(conn, "COPY $table FROM STDIN (FORMAT csv, HEADER FALSE)")

  status(result) == LibPQ.libpq_c.PGRES_COPY_IN || error("Failed")

  # Send data
  LibPQ.libpq_c.PQputCopyData(conn.conn, data, Int32(sizeof(data))) == 1 || error("Failed")

  # Signal end-of-data
  LibPQ.libpq_c.PQputCopyEnd(conn.conn, LibPQ.libpq_c.NULL) == 1         || error("Failed")
end

In the example the whole DataFrame is converted into one big String before sending it to the database, but it could be easily done incrementally by issuing several
PQputCopyData() for each chunk of data before calling PQputCopyEnd().

ExpandingMan · September 20, 2018, 1:58pm

Awesome, thanks for working on an example. This is very much the kind of thing I had in mind when I made my previous comment. It seems really unfortunate that we’d have to resort to something like this but at this point it seems like it may well be a limitation of libpq. I’m going to open an issue at LibPQ.jl to see if they are interested in adding something like this to the package.

Issue opened.

mantzaris · August 5, 2019, 12:39am

Is this still in the documentation? (https://invenia.github.io/LibPQ.jl/stable/)

fborda · August 5, 2019, 12:57am

For bulk insert you should use LibPQ.load!(table, conn, query) where table is any Tables.jl interface (which include DataFrame).

Though as far as I remember it only uses positional arguments instead of named arguments (the order of the columns instead of the name of the columns), so you have to consider this when writing the query string.

mantzaris · August 5, 2019, 1:02am

Will that approach with the DataFrame be any way preferable over the copyin approach of the docs copyin = LibPQ.CopyIn("COPY libpqjl_test FROM STDIN (FORMAT CSV);", row_strings)? I am currently struggling with the LibPQ.CopyIn. One there is the way they show on how to make that Iterator, and I am getting error about the UTF8 errors on some of the CSV strings. Should I just put in the DataFrame instead?

error | LibPQ]: CharacterNotInRepertoire: ERROR:  invalid byte sequence for encoding "UTF8": 0xa0
ERROR: CharacterNotInRepertoire: ERROR:  invalid byte sequence for encoding "UTF8": 0xa0

fborda · August 5, 2019, 1:24am

Both approaches are valid. Load! is a way to perform repeated INSERTs with a list of argument sets, and since it’s a normal INSERT (just repeated for each argument set) you can use every functionality available (for example Postgres UPSERT).

COPY is a Postgres extension to copy data FROM files TO tables and vice versa. COPY is very fast, but the file has to have the exact structure of the table (especially if it’s binary or CSV without headers).

In this case it’s not using a file directly though (I honestly never trying using it without a file). It’s probably a general error not related to LibPQ though:

https://stackoverflow.com/questions/4867272/invalid-byte-sequence-for-encoding-utf8

And answering which is better, if your data is in a file already and has the same schema of the table then use COPY (it’s faster and more scalable, I use for copying millions of lines and it takes a few seconds), if the data is in memory or you need a more customized insertion use Load!.

mantzaris · August 5, 2019, 1:52am

Thanks!
I set the encoding as psql=# copy tmp from '/path/to/file.csv' with delimiter ',' csv header encoding 'windows-1251';
from this answer https://stackoverflow.com/a/23794054/410975 and then it worked. I tried iconv and looks like I did not choose the right options. but now everything is loading and copying etc

Nosferican · September 5, 2019, 2:27am

I am having the following issues,

I can’t use COPY from file at the moment due to permissions (will probably change soon).
When using COPY from STDIN, the table is unchanged and nothing happens. I am using

data = readlines(filepath)
copyin = CopyIn("COPY test FROM STDIN (FORMAT CSV, HEADER);", data)
execute(conn, copyin)

When using load! it takes forever!,

vals = mapreduce(x -> "\$$x",
                 (x, y) -> "$x, $y",
                 1:ncols)
statement = "insert into $name values ($vals);"
execute(conn, "begin;")
load!(data, conn, statement)
execute(conn, "commit;")

johann.spies · September 10, 2019, 10:33am

The COPY command in PostgreSQL runs under user postgres. I suspect the file must have “postgres” as owner and must be in a directory where postgres can read it.

Did you commit the transaction after running the copy command? From your code above it does not look like it.

Nosferican · September 10, 2019, 1:03pm

I got superuser access to the new postgreSQL server so I was able to use the COPY command passing a file and it worked great.

For the COPY with STDIN is that missing something to commit the transaction? I believe I copy those lines from the documentation.

Topic		Replies	Views
How to quickly bulk insert into postgres 1 Data	3	648	August 29, 2019
PostgreSQL in Julia: LibPQ.jl Data package , data	9	5933	July 5, 2018
Given a PostgresSQL.jl connection string, what's the easiest way to create a table in the database by copying a dataframe into it? Data	6	913	September 14, 2021
Upload of DataFrame data to databases Data	6	2444	October 19, 2020
HTTP request body -> PostgreSQL New to Julia	8	1100	February 11, 2021

How to quickly bulk insert into postgres

Related topics