Generating "INSERT INTO" SQL statement

I have either LibPQ.Result or DataFrame as an input, and as output I need to generate SQL statement in the form:

INSERT INTO table VALUES
(v1,v2,v3),
(v4,v5,v6);

I cannot use LibPQ.load! because I don’t want to execute the statement, only to generate it. There are limited column types, but still, they are quite diverse:

uuid
string/varchars/text
integer
double
json
timestamp
interval
boolean

There will be missing values here & there, because some columns are nullable.

What would be the best/most efficient way to achieve this?

Thank you.

This is the function I came up with, but I am not certain that there is not a better, more efficient way:

function generateInsertSql(conn::LibPQ.Connection, schemaName::String, tableName::String, resultSqlFullTableName::String)::String
    try
        executeResult = execute(conn, "SELECT * FROM $schemaName.$tableName")
        if !isempty(executeResult)
            frame = DataFrame(executeResult)
            numOfRows = size(frame, 1)
            iob = IOBuffer()
            Base.write(iob, "\nINSERT INTO $resultSqlFullTableName VALUES \n")
            for n in 1:numOfRows
                Base.write(iob, "(")
                bFirst::Bool = true
                for x in propertynames(frame)
                    if bFirst
                        bFirst = false
                    else
                        Base.write(iob, ",")
                    end
                    if isa(frame[!, x][n], Missing)
                        Base.write(iob, "NULL")
                    elseif isa(frame[!, x][n], Int32) || isa(frame[!, x][n], Float64) || isa(frame[!, x][n], Bool)
                        Base.write(iob, string(frame[!, x][n]))
                    elseif isa(frame[!, x][n], String)
                        Base.write(iob, "'$(replace(frame[!, x][n], "'" => "''"))'")
                    elseif isa(frame[!, x][n], DateTime)
                        Base.write(iob, "'$(Dates.format(frame[!, x][n], dateformat"yyyy-mm-dd HH:MM:SS.sss"))'")
                    elseif isa(frame[!, x][n], Date)
                        Base.write(iob, "'$(Dates.format(frame[!, x][n], dateformat"yyyy-mm-dd"))'")
                    elseif isa(frame[!, x][n], Time)
                        Base.write(iob, "'$(Dates.format(frame[!, x][n], dateformat"HH:MM:SS.sss"))'")
                    elseif isa(frame[!, x][n], Dates.CompoundPeriod)
                        Base.write(iob, compoundPeriodToString(frame[!, x][n]))
                    else
                        println(typeof(frame[!, x][n]))
                        Base.write(iob, frame[!, x][n])
                    end
                end
                if n === numOfRows
                    Base.write(iob, ");\n")
                else
                    Base.write(iob, "),\n")
                end
            end
            return String(take!(iob))
        end
    catch exc
        # do nothing
        println(exc)
        println("Didn't generate INSERT SQL for table $tableName.")
    end
    return ""
end

Could this be made better using somehow join or CSV.write, or something like that?

Your if isa... elseif isa... block begs for multiple dispatch

2 Likes

It’s basically union splitting here, so it is fine.

Fine in terms of what? Are you saying multiple dispatch wouldn’t work in this case?

It will work, but you wouldn’t get any benefits from it, since it will define types at runtime and as such it will be slower than current implementation. But it’ll look better and design will be more flexible, that’s true.

Thanks. When I made that suggestion I was concerned about cleanliness and maintainability. I wasn’t aware of performance implications.

1 Like

Can you share a minimal reproducible example with the input and output? I would be happy to share my implementation with those.

I will provide one small example.
Let’s have input table public.use_cases, here is the structure:

CREATE TABLE public.use_cases
(
    id uuid NOT NULL DEFAULT uuid_generate_v4(),
    name character varying(150) COLLATE pg_catalog."default" NOT NULL,
    description text COLLATE pg_catalog."default",
    input text COLLATE pg_catalog."default",
    output text COLLATE pg_catalog."default",
    business_value text COLLATE pg_catalog."default",
    business_kpi text COLLATE pg_catalog."default",
    accuracy_impact integer,
    workspace_id uuid NOT NULL,
    dataset_id uuid,
    is_favorite boolean NOT NULL DEFAULT false,
    default_f_experiment uuid,
    default_ad_experiment uuid,
    created_at timestamp without time zone NOT NULL DEFAULT timezone('utc'::text, now()),
    created_by uuid NOT NULL,
    updated_at timestamp without time zone,
    updated_by uuid,
    business_objective text COLLATE pg_catalog."default",
    CONSTRAINT use_cases_pkey PRIMARY KEY (id)
)

Table has in reality also 6 foreign keys, but those are for our example irrelevant, so I don’t list those above.

You can fill it using SQL:

INSERT INTO public.use_cases VALUES
('ac7e086a-a7d5-4474-8ab4-e6b8c94994c7', 'Edited name', 'Edited description', 'Edited input', 'Edited output', 'test', 'test', 6, '3541ee59-d48e-4aaa-8f38-79554136462d', NULL, true, NULL, NULL, '2021-04-08 13:46:11.549539', 'b306ec38-ee19-4af5-9e3a-a36d2c22fe85', '2021-04-08 13:46:11.954205', 'b306ec38-ee19-4af5-9e3a-a36d2c22fe85', 'test'),
('d1197838-17c7-43c9-80f3-91d745ac32a8', 'xyz', NULL, NULL, NULL, NULL, NULL, NULL, '71ecb06c-2555-4f2b-9b99-d1dc9c0876ca', '07b18f7a-35b6-423d-b252-daf26d6e2e1c', false, NULL, NULL, '2021-03-17 13:17:18.743356', 'f31e650c-23b7-4886-a0a1-894c5eee9443', NULL, NULL, NULL),
('d726f6de-bd40-4c96-95c9-7bc080dab46d', '2 - with default exps', 'something else again', 'much data', '', '', '', NULL, '4e8fcfb2-4518-4445-afed-180941a0fd68', NULL, false, NULL, NULL, '2021-03-31 12:24:04.757653', '182b080b-8e42-4b85-82f5-e7e87bf8df0f', '2021-04-01 13:19:31.307656', '182b080b-8e42-4b85-82f5-e7e87bf8df0f', '');

Then you establish LibPQ connection conn and call this function as:

generateInsertSql(conn, "public", "use_cases", "schema.table")

And the result should roughly look like this:

INSERT INTO schema.table VALUES
('ac7e086a-a7d5-4474-8ab4-e6b8c94994c7', 'Edited name', 'Edited description', 'Edited input', 'Edited output', 'test', 'test', 6, '3541ee59-d48e-4aaa-8f38-79554136462d', NULL, true, NULL, NULL, '2021-04-08 13:46:11.549539', 'b306ec38-ee19-4af5-9e3a-a36d2c22fe85', '2021-04-08 13:46:11.954205', 'b306ec38-ee19-4af5-9e3a-a36d2c22fe85', 'test'),
('d1197838-17c7-43c9-80f3-91d745ac32a8', 'xyz', NULL, NULL, NULL, NULL, NULL, NULL, '71ecb06c-2555-4f2b-9b99-d1dc9c0876ca', '07b18f7a-35b6-423d-b252-daf26d6e2e1c', false, NULL, NULL, '2021-03-17 13:17:18.743356', 'f31e650c-23b7-4886-a0a1-894c5eee9443', NULL, NULL, NULL),
('d726f6de-bd40-4c96-95c9-7bc080dab46d', '2 - with default exps', 'something else again', 'much data', '', '', '', NULL, '4e8fcfb2-4518-4445-afed-180941a0fd68', NULL, false, NULL, NULL, '2021-03-31 12:24:04.757653', '182b080b-8e42-4b85-82f5-e7e87bf8df0f', '2021-04-01 13:19:31.307656', '182b080b-8e42-4b85-82f5-e7e87bf8df0f', '');

Probably you aware of that, but it is much better to use prepared statements instead of generating these INSERT queries. First of all prepared statements are optimized and secondly they more secure. The current approach is vulnerable to SQL injection attacks.

We should always remember Bobby Tables

3 Likes

I have some old code that does that but it didn’t work that well due to having to workaround all the sanitization issues (e.g., text that included special characters such as '). I still fail to understand the reason or use case for generating the SQL statement and not executing it which could allow you to use prepared statements.

I made a test for it using Octo.jl

see that logs of the github actions.

1 Like

I still fail to understand the reason or use case
for generating the SQL statement and not executing it

IMHO: it can be a core of flexible homemade database tool/scripts.

  • executing later in time ( backup → restore )
    • just check the postgres pg_dump output.
  • executing in a different ( unconnected? SQL database ) - as a security reason.
    • in this case - the target database administrator can validate/test the SQL insert script.
  • executing the output in a multiple - different SQL architecture
    • so the “create table” part is not hardcoded … partitioning, special compression parameter can be added later.
  • flexible data migration tool …
    • Arrow → DataFrame → plain SQL inserts → Any not supported SQL Database
  • etc.

my “hacked” pg_dump output … but this is not solve the dataframe problems.

# pg_dump -t use_cases --inserts --data-only --rows-per-insert=100 | egrep -v '^(SET|SELECT|--)' | grep .
INSERT INTO public.use_cases VALUES
	('ac7e086a-a7d5-4474-8ab4-e6b8c94994c7', 'Edited name', 'Edited description', 'Edited input', 'Edited output', 'test', 'test', 6, '3541ee59-d48e-4aaa-8f38-79554136462d', NULL, true, NULL, NULL, '2021-04-08 13:46:11.549539', 'b306ec38-ee19-4af5-9e3a-a36d2c22fe85', '2021-04-08 13:46:11.954205', 'b306ec38-ee19-4af5-9e3a-a36d2c22fe85', 'test'),
	('d1197838-17c7-43c9-80f3-91d745ac32a8', 'xyz', NULL, NULL, NULL, NULL, NULL, NULL, '71ecb06c-2555-4f2b-9b99-d1dc9c0876ca', '07b18f7a-35b6-423d-b252-daf26d6e2e1c', false, NULL, NULL, '2021-03-17 13:17:18.743356', 'f31e650c-23b7-4886-a0a1-894c5eee9443', NULL, NULL, NULL),
	('d726f6de-bd40-4c96-95c9-7bc080dab46d', '2 - with default exps', 'something else again', 'much data', '', '', '', NULL, '4e8fcfb2-4518-4445-afed-180941a0fd68', NULL, false, NULL, NULL, '2021-03-31 12:24:04.757653', '182b080b-8e42-4b85-82f5-e7e87bf8df0f', '2021-04-01 13:19:31.307656', '182b080b-8e42-4b85-82f5-e7e87bf8df0f', '');

for inspiration - the pg_dump : PostgreSQL: Documentation: 13: pg_dump syntax

  • prefer COPY vs INSERT - for speed reason
    “Dump data as INSERT commands (rather than COPY ). This will make restoration very slow; it is mainly useful for making dumps that can be loaded into non-PostgreSQL databases. Any error during reloading will cause only rows that are part of the problematic INSERT to be lost, rather than the entire table contents. Note that the restore might fail altogether if you have rearranged column order. The --column-inserts option is safe against column order changes, though even slower.”
  • can customize the INSERT output; IMHO: similar will be useful
    • --rows-per-insert=1000
    • --inserts vs --column-inserts

I think pgAdmin does a lot of it already (Python based if not mistaken) as part of the GUI experience. I would welcome SQL generating code for example to create a table based on a Julia object like how R’s RPostgresSQL does and that could be used to add new features. Would also be nice for the postGIS domain area of creating the tables and structures for polygons and whatnot in a user friendly manner (e.g., how R’s sf does). For inserting, records I would try to go for something for preparing the statements.

Yes, the reason is similar to what ImreSamu wrote.
We want to export certain section/portion of our data.
We first copy required subset to temporary schema (it is about 10 tables with foreign key relationship between them), then create SQL from that, including all data there. Whole schema is then dropped, SQL text encrypted & downloaded.
Such file can be then uploaded and imported into different server instance, under different user, etc.
I could not use pg_dump there, because I have only LibPQ connection to the database; I cannot call pg_dump process there.

IMHO:

  • This should be the same connection ( user; pw; port; host ) so If you can install the postgresql-client package, it should work … for medium-big databases (> 10 GB) it is better with the default COPY. ( and battle-tested ) … but at some extreme paranoid site ( ~banking ) not so easy to install any extra package/tool.
  • in python there is a wrapper package for pg_dump: dump · PyPI
    “Wrapper around psql and pg_dump to make it easier to backup/restore a PostgreSQL database”
    The Julia wrapper still missing.

on the other hand,

  • Julia based - text SQL target ( “insert into”, copy ) would be useful. :+1:

Maybe you can just store everything in csv (or any other table storage format) and use prepared statements to load data to new tables?

1 Like
  1. Maybe I misunderstand, but I cannot find anything anywhere about julia package named “postgresql-client”.

  2. Using COPY: if I use SQL as this one (example generated by pg_dump using default options)

COPY some_ids (id, other_id) FROM stdin;
4e2aafd7-c7cc-46de-84ec-dbd8eb8b498d	99890a8b-2579-4030-bf54-0805e7f8e03a
\.

and I execute it, I am getting error:

ERROR:  syntax error at or near "4e2"
LINE 2: 4e2aafd7-c7cc-46de-84ec-dbd8eb8b498d 99890a8b-2579-4030-bf54...
        ^
SQL state: 42601
Character: 205

The code as I have here is actually working, but may not be the most efficient.

sorry; this is a Linux package.

  • sudo apt-get install postgresql-client

this is based on the “PostgreSQL Client Applications”

I don’t know what is the Windows package name, and probably you need to run an .exe version pg_dump.exe

Using COPY: if I use SQL as this one (example generated by pg_dump using default options)
and I execute it, I am getting error:

this is a “PostgreSQL ONLY” command → PostgreSQL: Documentation: 13: COPY

  • so : “There is no COPY statement in the SQL standard.”
  • important: it is a tab separated file! ( " tab character" )
    • be careful when copy/paste! some editor replacing with 4 space!
pg_dump -t use_cases > use_cases_backup.sql

# restore - with piping
cat use_cases_backup.sql | psql 
# restore - as a simple SQL file 
psql -f  use_cases_backup.sql

in my backup file contains 51 “tab character”

grep -o -P '\t' use_cases_backup.sql | wc -l
51

if this is working via psql client … ( as a basic debug tools )
and not with LibPQ … then need to debug … why not ?