(De-)Serialize N-dimensional arrays in julia

BambOoxX · May 27, 2021, 12:31pm

Hello ! Would anyone know if there is currently a julia serialization package that supports Matrices, or more generally n-dimensionall Arrays ?
I am currently using MsgPack.jl, but the specification does not support this natively. In the same manner, is there one that supports complex numbers ?
There again, I tried something beyond the scope of the specification (see this PR https://github.com/JuliaIO/MsgPack.jl/pull/45), but this seems more like cheating than really solving the issue at the source.

stevengj · May 27, 2021, 1:33pm

What are you trying to serialize to? A file, or …?

BambOoxX · May 27, 2021, 2:20pm

At the moment, I am trying to serialize standard objects such as scalar strings or numbers (except complex ones ^^) and vectors contained in structs.
The most complex object I can serialize at the moment is a set of nested vectors, so a Vector{Vector{<something>}}, which seems to be handled quite well.

The highest level object in my program is a mutable struct containing immutable structs which is used as a container and is updated each time a new instruction is passed to this program.

Ideally I want to be able to pass a complete field of this struct to the serializer.

Peter_Adelman · May 27, 2021, 2:34pm

If you are serializing them for storage, why not use jld2.jl?

BambOoxX · May 27, 2021, 2:47pm

I never said I am trying to store them

I am actually using this for some IPC between julia and another program.

foobar_lv2 · May 27, 2021, 3:56pm

If you need n-dimensional arrays of numbers only:

Consider not explicitly (de-)serializing at all: import Mmap. In other words, let your operating system kernel deal with the problem.

BambOoxX · May 27, 2021, 4:05pm

I also thought about memory-mapping to perform this,
I saw in Mmap help that

In practice, consider encoding binary data using standard formats like HDF5 (which can be used with memory-mapping).

so HDF5 seems to be a good way to do this (both advised and available in a lot of languages).
However, I never did any memory-mapping, even less betwee, different processes.
If you have any advice on this, I am all ears !

foobar_lv2 · May 27, 2021, 4:36pm

The concept of “standard format for binary data” is somewhat outdated: Today every relevant CPU uses identical little-endian encoding. Especially if you don’t need your binary data for long-term storage or design protocols for long-term use, you can just go with mmap. (future you who needs to read these files in 30 years will hate your for this)

The more interesting difficult problem that formats like hdf5 solve is that you sometimes need containers, i.e. cannot afford to have a single file per array. Also, hdf5 is self-describing (some ascii names, array dimensions and element types, etc); if you get into the situation where there is just some binary file, and nobody alive remembers what this is supposed to contain (only that it is supposedly important), then you are are in for a bad time.

If you want to use this for IPC, consider that there are afaiu no ready-made locks/atomics in base/stdlib that are happy to live inside mmaped files (i.e. for having cross-process locks / atomic updates).

BambOoxX · May 27, 2021, 6:50pm

To give you a bit of background regarding my situation :

I am an engineer mostly using my computer for linear algebra computations, so it is likely that anything I do will return matrices in some form of container
My end goal is to communicate between julia and another program (an existing GUI that I cannot modify extensively)
I tried to
- embed libjulia in the calling program → far too complicated to be of any relevance
- use serialization over pipes to transfer data → fairly complicated but doable for someone not trained in general computing but AFAIK lacks some math objects like matrices or complex numbers
- write/read to a file on disk, but it far less clean as wokring in memory of course

Regarding your previous remarks, in the end, I do not care much how I can achieve this communication, as long as it is relatively understandable, and maintainable. I do not mind implementing my fair share of stuff (as I already did for MsgPack), but making it my main occupation or starting this kind of thing from scratch is simply out of the scope.
I think that HDF5 could be useful because implementations to read it are broadly available, and that I need to store a lot of different data so mapping only one array is irrelevant here.

Concerning

I am sorry, but you will have to translate that for me ^^, if you don’t mind :).

ToucheSir · May 27, 2021, 8:04pm

HDF5 isn’t really designed as a wire format. Generally one would write some data to disk using hdf5 and read it back later, but it seems like you need more “real time” communication. Apache Arrow via Arrow.jl would be the ideal approach, but AFAICT the latter hasn’t implemented N-D array support just yet.

foobar_lv2 · May 27, 2021, 8:28pm

Regarding my comment on atomics and locks: Suppose program A and program B both access some array M at the same time; and A writes somewhere and B reads (or writes) the same position. This is called a data race. Data races need to be handled, on pain of unpredictable results. This is the same situation as when you have a multi-threaded program.

Now, many tools for handling data races in multi-threaded julia are unavailable when you are in the setting where multiple processes share a mapped array. Some of these (e.g. julia locks) are unavailable for good reasons (how would some C# code understand how various julia synchronization primitives are working? Also, at some point you may need to involve the operating system, in order to wait/notify on locks).

Other synchronization primitives are simply missing in the standard library, namely atomics on array elements. If you need them, you will need to copy-paste-modify the relevant Base/Threads code.

If your setting is that both programs don’t run concurrently, then the above point is moot.

If you need to share many arrays of binary data, and both processes are running concurrently, then a possibility is to do what SharedArrays is doing: have some connection, whether pipe or socket, have anonymous mappings and then use shm_open to grab them on the other end. That depends on whether your other program has sane access to such operating system goodies (and I guess on what OS you’re running – I am assuming linux).

What language is your GUI tool written in?

BambOoxX · May 27, 2021, 8:54pm

I understood that data races are very bad indeed ! Currently my mechanism is using a Sockets.PipeServer for the input and one for the output, so the files in this case could very well be readable on one side and writeable on the other. Your point is very informative though, I won’t refuse that !

The GUI is supported by .NET so it is mostly made of VB, but my current tests are using C# (for reasons beyond the current scope)

jzr · May 27, 2021, 8:58pm

Arrow tensor might do what you want.

https://arrow.apache.org/docs/format/Other.html

BambOoxX · May 27, 2021, 9:15pm

@jzr @ToucheSir Arrow could be interesting too, but as MsgPack, tensors are not available in julia at the moment.

For Arrow, it is the julia implementation, for MsgPack it’s the specification, so the first one should be available faster though it’s not sure.

EDIT : Looking at the JuliaData/Arrow.jl repo, I stumbled upn this https://github.com/JuliaData/Arrow.jl/issues/125.

stevengj · May 27, 2021, 10:48pm

You can always just call vec(A) to the array into a 1d array, and also pass the dimensions size(A) as a separate part of the message, then call reshape(B, sizes...) on the other side to reshape back into an N-dimensional array.

BambOoxX · May 28, 2021, 4:59am

Actually, the current implementation I use on C# performs something like this.
A matrix A appears to be serialized as [size(A,1),size(A,2),vec(A)] which is rather logical.
However, this seems to be unsafe in terms of data handling, because, if I want to de-serialize this in julia, how can I make the difference between a list containing two integers and a vector, and a matrix ?

Also, such serialization protocols often offer so-called extension types, which can be used to add capabilities to the implementations. But for such low-level object like matrices it seems just like reinventing the wheel…

gustaphe · May 28, 2021, 5:45am

No matter how you serialize, you will need to somehow communicate how to deserialize. How about serialization(A) = [ndims(A), size(A)..., vec(A)...]?

edit: Unfortunately vec(::Number) doesn’t exist, so you would need to make serialization(A::Number) = [0, A] or something.

BambOoxX · May 28, 2021, 6:47am

Though I technically agree with you on the possibility to perform the serialization this way, my goal is not to write my own protocol, rather use an existing protocol available broadly so that I do not need to start from scratch, hence my first question.

Creating a protocol would surely have its own academical interest, but I just can’t afford to do so.
This will be a means to an end, I can’t spend more time on it that on the actual computing I have to develop.

ericphanson · May 28, 2021, 10:36am

what about a representation more like:

struct FlatArray{T <: Tuple, V <: AbstractVector}
    size::T
    values::V
end

? Since one can already configure custom (de)-serialization of structs with MsgPack.

BambOoxX · May 28, 2021, 5:08pm

If I were using this for julia–julia communications, this would probably be a good scenario too. But again, I have to communicate with an external program, so everything added comes with twice the cost.

Also, suppose have a struct that is convenient for my application that I want to serialize which is typed as

struct TestStruct
    field1::Vector{Vector{Float64}}
    field2::Vector{Vector{ComplexF64}}
    field3::Vector{Matrix{ComplexF64}}
end

field1 could be (de-)serialized with the default behavior, but what about the others ? I don’t see how I can do this without modifying the current MsgPack.jl quite a bit.

Topic		Replies	Views
A future for JLD2? Community jld2	56	9778	July 19, 2020
What is the preferred way to save variables? General Usage jld , hdf5 , jld2	39	18523	August 24, 2021
Can't read old JLD2 file Tooling	17	2938	February 19, 2019
Open Python serialized object in julia General Usage python , serialization	3	868	February 14, 2022
Proposal: working with larger than memory data in hdf5 format using HDF5Arrays (implementation of DiskArrays.jl for HDF5) Data hdf5	11	1725	November 4, 2020

(De-)Serialize N-dimensional arrays in julia

Related topics