How to serialize data to, and deserialize from, postgres?

John_Boik · September 13, 2023, 6:38am

I would like to store some Julia data in a PostgreSQL bytea column and then query the database and convert the result back to its original form. I can store the data, but can’t seem to convert the query response correctly. Below is an example of saving a vector of integers (although later I would like to store complex dicts and named tuples).

io = IOBuffer();
Serialization.serialize(io, [1,2,3])
# conn is a connection to the database
LibPQ.load!(
    (data = [io]),
    conn,
    "INSERT INTO mytable (data) VALUES (\$1);",
)

That works fine. And I can retrieve it, but can’t get the vector back.

result = LibPQ.execute(
    conn, 
    "SELECT * from mytable"
)   
df = DataFrames.DataFrame(result)
data = df[1,"data"]
io = IOBuffer(data)
seek(io,0)

# the following does not work as I expect
test = Serialization.deserialize(io)

In the last line, test is of type UnionAll. That is, it is a type, not a vector of integers. I’ve tried numerous variations on the above, and nothing seems to work correctly. data is of type Vector{UInt8} and contains e.g., [0x49, 0x4f, 0x42…]. Trying deserialize(data) causes an error.

Any ideas on how to retrieve the original vector?

nandoconde · September 13, 2023, 3:28pm

Have you checked that io.data at the “deserialization”-side actually contains what you wrote to the serialization-side?

Maybe serialization depends upon Julia version, but if I try your code, my io.data at Serialization side does not contain [0x49, 0x4f, ...] but [0x37, 0x4a, 0x4c, ..].

Palli · September 13, 2023, 5:27pm

You are trying to serialize an array [1,2,3] to io (which is ok), and then put it into PostgreSQL (possibly not advised, but ignoring that), but you wrap io in another array:

    (data = io),

or even likelier:

    (data = io.data),

Note:

julia> typeof(io.data)
Vector{UInt8} (alias for Array{UInt8, 1})

julia> [io] == io
false

julia> typeof([io]) == typeof(io)
false

julia> typeof(io.data) == typeof(io)
false

A single Int64 takes 8 bytes in memory (and I suppose when you serialize), that’s more than the text string “[1,2,3]”, so if in fact you need to store such an array, and many values are 1 or 2 decimal digits, it may help to store in a simple text-based dataformat, human readable (your call if you skip spaces). Another option is a JSON format (or some other e.g. Arrow or HDF5, which allows compression unlike it seems Julia’s default?), it’s a least a standard, so your database would be less specific to Julia, it’s text-based, and if you have lots of data then PostgreSQL’s JSONB binary varaint of it could help you.

One thing to be aware of is that 1 is Int64 on 64-bit only thus [1,2,3] will be Vector{Int64} there but would be Vector{Int32} on a 32-bit machine.

What you could to on is use Int64[1,2,3] to force that type regardless of platform, or alternatively e.g. Int32[1,2,3], if that’s enough for you, even Int16 ok?

Note whatever you do it will be save to store (just a question what you get back or if it’s what you want):

https://docs.julialang.org/en/v1/stdlib/Serialization/

Write an arbitrary value to a stream in an opaque format, such that it can be read back by deserialize. The read-back value will be as identical as possible to the original, […]

An 8-byte identifying header is written to the stream first. To avoid writing the header, construct a Serializer and use it as the first argument to serialize instead. See also Serialization.writeheader.

The data format can change in minor (1.x) Julia releases, but files written by prior 1.x versions will remain readable. The main exception to this is when the definition of a type in an external package changes. [stuff you can likely ignore on functions. Also anonymous:] Anonymous functions require special care: because their names are automatically generated, minor code changes can cause them to be renamed. Serializing anonymous functions should be avoided in files intended for long-term storage.

In some cases, the word size (32- or 64-bit) of the reading and writing machines must match. In rarer cases the OS or architecture must also match, for example when using packages that contain platform-dependent code.

I was a bit confused why word size must match in some cases, I think it might be only if you omit the header:
https://docs.julialang.org/en/v1/stdlib/Serialization/#Serialization.writeheader

Write an identifying header to the specified serializer. The header consists of 8 bytes as follows:

Offset Description

0 tag byte (0x37)

1-2 signature bytes JL

3 protocol version

4 bits 0-1: endianness: 0 = little, 1 = big

4 bits 2-3: platform: 0 = 32-bit, 1 = 64-bit

Since/when that info is stored in the header, I would think you’re ok. If you skip it then you would skip both. If you really need to conserve space, you could skip that tiny 8-byte header.

That’s the tag byte and then the upper-case signature JL, so it seems ok, for you to store that and get back.

That’s wrong, it’s IOB, I guess the prefix of IOBuffer, I suspect this:

IOBuffer(data=UInt8[...], readable=true, writable=true, seekable=true, append=false, size=36, maxsize=Inf, ptr=37, mark=-1)

and maybe no data at all…

John_Boik · September 13, 2023, 6:12pm

Thanks @nandoconde. I was just giving an example of the data, and yes I did check. I can serialize to an IOBuffer and then deserialize from the same buffer to get back the original data. The problem occurs when I save the buffer to Postgres, then read it back.

John_Boik · September 13, 2023, 6:26pm

Thanks @Palli. Good information, but I’m not sure that you addressed my problem directly. Or if you did I’m not seeing a solution. Regarding the vector of integers, I’m just using that as an example to serialize and then deserialize, just to get it working properly. Also, given a buffer, io, the problem is not solved if I store io.data in Postgres, rather than io, or if I store take!(io) in Postgres, or if I store seek(io,0). I use [io] (a vector) in the LibPQ.load query because that is the required format for the load command. Also, I can serialize a vector (or dict, etc.) and then deserialize it to get the data back. The problem occurs only when I store the serialized data in Posgres, then query it out and deserialize. Also, using BSON.jl rather than Serialization does not fix the problem.

Palli · September 13, 2023, 6:41pm

I see now a Vector is always expected by PostgreSQL, and here a then a Vector of Vectors (of bytes, i.e. Uint8, not of IOBuffers), but then it seems maybe in that case you should do:

LibPQ.load!(
    (data = [take!(io)]),  # Not this invasive way: (data = [io.data]) that though may be equivalent?

i.e. io.data is the data,

julia> length(io.data)
36

as opposed to:

julia> length([io])
1

since that’s giving you 1 (only) element of:

julia> [io]
1-element Vector{IOBuffer}:
 IOBuffer(data=UInt8[...], readable=true, writable=true, seekable=true, append=false, size=36, maxsize=Inf, ptr=37, mark=-1)

julia> [io][1]
IOBuffer(data=UInt8[...], readable=true, writable=true, seekable=true, append=false, size=36, maxsize=Inf, ptr=37, mark=-1)

The reasons that load! expects a Vector, is not that you have a Vector of bytes, but because you can do a mass-INSERT, i.e. it loops through and does many for each entry in the Vector then commits.

So in your case it expects (I think, or at least conceptually wants this):

julia> typeof([io.data])
Vector{Vector{UInt8}} (alias for Array{Array{UInt8, 1}, 1})

not the same as:
julia> typeof(io.data)
Vector{UInt8} (alias for Array{UInt8, 1})

nor:
julia> typeof([io])
Vector{IOBuffer} (alias for Array{Base.GenericIOBuffer{Array{UInt8, 1}}, 1})

That is a Vector of IOBuffers (that you [plausibly might] expected to work), and I was thinking why it worked at all? Why did LibPQ allow it? Should it be disallowed (by LibPQ?), or simply made (by LibPQ) to work too, do the same as the first form?

John_Boik · September 13, 2023, 7:46pm

Thanks @Palli. If you are suggesting that I use [io.data], [[io]], or [[io.data]] in the LibPQ.load query, none of these solve the problem.

Also, I can see that what is taken out of Postgres is not the same as what is put in. If the original data is Dict(:a => [1,2,3]), then the io buffer going in is

IOBuffer(data=UInt8[...], readable=true, writable=true, seekable=true, append=false, size=72, maxsize=Inf, ptr=73, mark=-1)

and io.data is

UInt8[0x37, 0x4a, 0x4c, 0x14, 0x04, 0x00, 0x00, 0x00, 0x35, 0x10, 0x01, 0x04, 0x44, 0x69, 0x63, 0x74, 0x1f, 0x4e, 0x9e, 0x44, 0x02, 0x00, 0x00, 0x00, 0x00, 0x01, 0x10, 0x50, 0x1f, 0x4e, 0x9b, 0x44, 0x02, 0x00, 0x00, 0x00, 0x00, 0x08, 0xe0, 0x01, 0x00, 0x00, 0x00, 0x66, 0x15, 0x00, 0x08, 0xe2, 0x01, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x02, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x03, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00]

On the other hand, if I send [io.data] to Postgres in the load command, a buffer of the data returned from Postgres is

IOBuffer(data=UInt8[...], readable=true, writable=false, seekable=true, append=false, size=176, maxsize=Inf, ptr=1, mark=-1)

and the buffer data is

UInt8[0x7b, 0x35, 0x35, 0x2c, 0x37, 0x34, 0x2c, 0x37, 0x36, 0x2c, 0x32, 0x30, 0x2c, 0x34, 0x2c, 0x30, 0x2c, 0x30, 0x2c, 0x30, 0x2c, 0x35, 0x33, 0x2c, 0x31, 0x36, 0x2c, 0x31, 0x2c, 0x34, 0x2c, 0x36, 0x38, 0x2c, 0x31, 0x30, 0x35, 0x2c, 0x39, 0x39, 0x2c, 0x31, 0x31, 0x36, 0x2c, 0x33, 0x31, 0x2c, 0x37, 0x38, 0x2c, 0x31, 0x35, 0x38, 0x2c, 0x36, 0x38, 0x2c, 0x32, 0x2c, 0x30, 0x2c, 0x30, 0x2c, 0x30, 0x2c, 0x30, 0x2c, 0x31, 0x2c, 0x31, 0x36, 0x2c, 0x38, 0x30, 0x2c, 0x33, 0x31, 0x2c, 0x37, 0x38, 0x2c, 0x31, 0x35, 0x35, 0x2c, 0x36, 0x38, 0x2c, 0x32, 0x2c, 0x30, 0x2c, 0x30, 0x2c, 0x30, 0x2c, 0x30, 0x2c, 0x38, 0x2c, 0x32, 0x32, 0x34, 0x2c, 0x31, 0x2c, 0x30, 0x2c, 0x30, 0x2c, 0x30, 0x2c, 0x31, 0x30, 0x32, 0x2c, 0x32, 0x31, 0x2c, 0x30, 0x2c, 0x38, 0x2c, 0x32, 0x32, 0x36, 0x2c, 0x31, 0x2c, 0x30, 0x2c, 0x30, 0x2c, 0x30, 0x2c, 0x30, 0x2c, 0x30, 0x2c, 0x30, 0x2c, 0x30, 0x2c, 0x32, 0x2c, 0x30, 0x2c, 0x30, 0x2c, 0x30, 0x2c, 0x30, 0x2c, 0x30, 0x2c, 0x30, 0x2c, 0x30, 0x2c, 0x33, 0x2c, 0x30, 0x2c, 0x30, 0x2c, 0x30, 0x2c, 0x30, 0x2c, 0x30, 0x2c, 0x30, 0x2c, 0x30, 0x7d]

Lastly, I get something slightly different if I send [io] to Postgres, rather than [io.data]. In that case, a buffer of the returned data is

IOBuffer(data=UInt8[...], readable=true, writable=false, seekable=true, append=false, size=122, maxsize=Inf, ptr=1, mark=-1)

I don’t know why the data returned is not the same as the data put into Postgres.

Palli · September 13, 2023, 10:20pm

I’m not, rather data = [take!(io)] which gives you(date = [io.data] is possibly also ok):

julia> take!(io)
36-element Vector{UInt8}:
 0x37
[header]
 0x15
 0x00
 0x08
 0xe2
[that was likely encoding Vector{Int64} and/or length, then the three values of the array. And I would start with getting this to work, not a Dict or more complex.]

You want to get the same back, same length, and type, the writable=false vs true for IOBuffer seems ok, but not the different size=176.

I don’t think the problem is deserialize but what you had to it. It might be because of postgreSQL but I not you put the result into a DataFrame first. You could try seeing if it’s the problem, just put data in a DataFrame and try to get out.

You are hopefully inserting the data correctly (i.e. LibPQ converts binary to text hex format, but you might be getting it back as encoded text, which wouldn’t be too hard to work around):

The SQL standard defines a different binary string type, called BLOB or BINARY LARGE OBJECT. The input format is different from bytea, but the provided functions and operators are mostly the same. […]

8.4.1. bytea Hex Format
The “hex” format encodes binary data as 2 hexadecimal digits per byte, most significant nibble first. The entire string is preceded by the sequence \x (to distinguish it from the escape format). In some contexts, the initial backslash may need to be escaped by doubling it (see Section 4.1.2.1). For input, the hexadecimal digits can be either upper or lower case, and whitespace is permitted between digit pairs (but not within a digit pair nor in the starting \x sequence). The hex format is compatible with a wide range of external applications and protocols, and it tends to be faster to convert than the escape format, so its use is preferred.

Example:

SET bytea_output = 'hex';

SELECT '\xDEADBEEF'::bytea;
   bytea
------------
 \xdeadbeef

I recall now BLOB support was a major hassle (for Informix database, and PHP, way before Julia). bytea_output has other options, and hex is the default, but none of them were binary. I’m not sure BLOB support, working with, getting data in and out, is standardized, at least bytea is PostgreSQL specific. You would expect the data to expand by 100% for that hex format, plus \x prefix, but in your case seemingly a little more.

Ideally this decoding of the (hex) data would be transparent, but I’m just not sure it’s attempted, what does the DataFrame tell you, does it show a non-binary String?

I did find something in LibPQ.jl and maybe you need to call parse or pqparse:
https://github.com/search?q=repo%3Aiamed2%2FLibPQ.jl%20PQunescapebytea&type=code

LibPQ.jl uses LibPQ_jll, which wraps the (official) C library, I believe.

I see from just this year (and other more recent changes): libpqxx: the official C++ language binding for PostgreSQL

2023-01-12: At last! Faster than C.

First to get this to work, but if someone is reading this, would it be better for LibPQ_jll to rather wrap the C++ library? Maybe not so much for this, just speed…

John_Boik · September 15, 2023, 2:29am

Thanks @Palli, I tried take!(io), but same problem. Does anyone else have any suggestions as to how to put a serialized Julia object in a bytea column of Postgres and then query and deserialize it? There must be some trick…perhaps coding the Postgres input or decoding the Postgres output in some way. Has anyone stored a png or other image data (or other binary object) in a Postgres bytea column and then decoded it into a Julia object? I know its usually not best practices to store a Julia object this way, but it would be handy to have the possibility in some cases.

Palli · September 25, 2023, 1:29pm

hex may be the default or not, see the link, but setting to escape would be faster. Slightly easier to decode hex though, so you could try that first. But either will be plain text and will need decoding to actual bytes, and it seems LibPQ doesn’t do it for you.

PostgreSQL 16 was just released, in case you want to use its JSON or JSONB type type, then I see the latest version is “including more SQL/JSON syntax” and many other improvments.

Topic		Replies	Views
Lack of stable PostgreSQL-communication a major showstopper Data	33	4814	January 8, 2021
Generating "INSERT INTO" SQL statement Data dataframes , postgresql	19	2549	April 20, 2021
Resurrecting universal database API Data	60	4575	August 17, 2018
Seeking assistance to INSERT data into Postgres DB using LibPQ.jl Data question , postgresql	5	983	July 22, 2022
Easiest and most complete package for PostgreSQL right now (Feb 2022)? General Usage postgresql	48	7180	July 11, 2025

Offset	Description
0	tag byte (0x37)
1-2	signature bytes JL
3	protocol version
4	bits 0-1: endianness: 0 = little, 1 = big
4	bits 2-3: platform: 0 = 32-bit, 1 = 64-bit

How to serialize data to, and deserialize from, postgres?

Related topics