Human-readable externalization for multi-dimensional arrays

Tamas_Papp · August 6, 2019, 11:20am

Is there an externalization format that is

more or less human-readable in a pinch (eg like JSON),
but supports multi-dimensional arrays (eg like HDF5),
and has a package for Julia?

Tamas_Papp · August 6, 2019, 11:26am

Or, alternatively, a neat little utility that tries to reconstitute vectors of vectors of vectors etc into an Array of the relevant dimension number.

Zach_Christensen · August 6, 2019, 12:46pm

I once had some researchers that wanted MRI data in a format that they could read using Excel (sigh). I used “x”, “y”, “z”, “value” columns to do it.

Tamas_Papp · August 6, 2019, 12:58pm

Sorry, but I don’t quite see how this is relevant to my question.

Zach_Christensen · August 6, 2019, 1:03pm

If you want human readable you just do this in CSV format.

PetrKryslUCSD · August 6, 2019, 1:27pm

Sparse? Then CSV is quite natural.

Tamas_Papp · August 6, 2019, 1:28pm

Perhaps you misunderstood my question, I need general Array{T,N}.

CSV can be mapped to a single Array{T,2}, not adequate for saving a collection of heterogeneous items.

JSON is human-readable, but can mostly cope with vectors. I would need to invent a metadata mapping for reshaping those.

BSON.jl handles most things fine, but is a binary format.

dpsanders · August 6, 2019, 2:11pm

I think the suggestion was to store a sparse multi-dimensional array by giving a list of (index, value) pairs by writing out the index, e.g.

1 2 4 6 -1.5
3 1 7 2 -10.0

represents a sparse 4D array with floating-point values.

carstenbauer · August 6, 2019, 2:24pm

Maybe it’s just me but

and

is almost mutually exclusive, isn’t it? Why do you want it to be human-readable? Presumably, you won’t be able to parse the multi-dimensional information anyways.

Tamas_Papp · August 6, 2019, 2:35pm

Nay, I think that this is just a historical accident: JSON filled the human-readable niche, but comes from a language without support for multidimensional arrays. There is no a priori reason they could not be supported.

Eg consider a JSON syntax variant where [A; B; C] stacks objects along an extra (last) dimension, making a matrix from vectors, etc. This could easily extend JSON. People can parse multidimensional arrays just fine (eg matrices in Julia). But I don’t quite want to introduce a new standard.

Human-readable formats are great for some applications.

Tamas_Papp · August 6, 2019, 2:37pm

Unfortunately, nothing I have is sparse. That alone would not be an obstacle (this is just an efficiency issue), but I would have to effectively write a new library for this, and mentally reconstituting these objects would stretch the concept of human-readable a bit.

To make things concrete, I would be happy if I could save and read back

(a = 1, b = ones(2), c = (d = ones(3, 4), e = ones(2, 3, 4)))

in a human-readable format. JSON is almost there.

StefanKarpinski · August 6, 2019, 3:27pm

Writing our I1, I2, …, In, value tuples as CSV doesn’t require your data to be sparse, it’s just very inefficient if your data is dense since for dense data you can leave the indices implicit and save a lot. But if you want something human readable then presumably your data isn’t that huge so that may be fine.

spaceLem · August 7, 2019, 12:32pm

I don’t have an answer for you, but I know that when Octave stores an N-dim array in ascii format, it first gives you the number of dimensions N, then the next N numbers are the size of those dimensions, and finally the values. E.g.

x = reshape(1:24, 3,4,2);
save x.dat x

Then the data file it saves looks like

# name: x
# type: matrix
# ndims: 3
 3 4 2
 1 2 3 ... 24

Maybe you could store the data like that, and recreate it? Maybe in JSON you could do something like

{
    "name": "x",
    "type": "Array{Int64, 3}",
    "dims": [3, 4, 2],
    "data": [1, 2, 3, ..., 24]
}

That’s easily human readable, and should give you all the information you need to recreate the array manually.

Zach_Christensen · August 7, 2019, 12:40pm

Yes, this is what I intended but I had an experiment I had to run before I could take the time to explain it.

spaceLem:

Maybe you could store the data like that, and recreate it? Maybe in JSON you could do something like
{
    "name": "x",
    "type": "Array{Int64, 3}",
    "dims": [3, 4, 2],
    "data": [1, 2, 3, ..., 24]
}

This is actually how multidimensional MRI data is stored in JSON files typically (except we assume you know the “dims” field based on some other stuff. It definitely works but it assumes those intended to digest the information will take the time to figure out how to translate between “dims” + “data” to an actual array (which is only obvious if you are code savvy).

If you do find a good solution please share because I obviously haven’t found anything that is completely satisfying.

Tamas_Papp · August 7, 2019, 12:42pm

Thanks. I feel that there should be a verb for “wrestling with temptation to invent a new format”.

tbo · August 13, 2019, 6:42am

HDF5 is godsent! Why would anyone want to use anything else?!
Instead of trying to invent something more human readable, we should clearly focus on inventing better humans.

Tamas_Papp · August 13, 2019, 6:48am

HDF5 is a rather baroque standard, a fraction of which is used in practice (say hello to the 400-page user guide). The only “compliant” implementation is the one from the HDF Group. Most APIs just access it via their C library. It has limited UTF8 support, has high risks of data corruption (in case the process is interrupred). You can find writups like Cyrille Rossant - Moving away from HDF5

Most people love HDF5 until they get burnt, then look for something else, and find only experimental efforts.

fabiangans · August 13, 2019, 8:22am

The Zarr format in python supports a JSON compressor, which was originally made for storing objects, but can be used for normal data as well:

import numcodecs
import zarr
import numpy as np
x = np.arange(400).reshape((10,10,4))
zarr.array(x, store="json_array.zarr", chunks=(5,5,2), compressor=numcodecs.JSON())

A chunk of the array would now look like this:

cat json_array.zarr/0.0.0

[[[0,1],[4,5],[8,9],[12,13],[16,17]],[[40,41],[44,45],[48,49],[52,53],[56,57]],[[80,81],[84,85],[88,89],[92,93],[96,97]],[[120,121],[124,125],[128,129],[132,133],[136,137]],[[160,161],[164,165],[168,169],[172,173],[176,177]],"<i8",[5,5,2]]

Currently this compressor is not yet supported in Zarr.jl but feel free to make an issue if there would be general interest, should not be too hard to implement.

tbo · August 13, 2019, 12:08pm

My comment should of course be taken with a grain of salt. HDF5 needs improvement in many cases and it is definitely a bit clunky. And yes there’s only one implementation, but are there e.g. multiple implementations of Julia?

I know of no other format offering this amount of flexibility, performance and robustness at the same time, I rather consider HDF5 a format construction kit plus IO library than a format by itself. I have already read Cyrilles article some years ago and disagree to many of his points. Plus recent HDF5 versions have drastically improved in many aspects, especially metadata performance and compression. Complaining about bad performance while suggesting file system based data organisation is questionable at least. A lack of robustness has never been an issue for me. HDF5 is the only widely used format/IO-library/thingy I know of that I could use for robust parallel IO on ~200.000 cores.

Back to your original question:
there’s typically good reason for storing binary data in a binary format, for that there are only few frequently used ASCII formats. My best guess would be to try some of the ASCII formats supported by Matlab (either standard or sparse), if interoperability is your primary concern.

Tamas_Papp · August 13, 2019, 1:00pm

I am not sure how this comparison is relevant, since Julia does not advertise itself as a long-term data storage solution.

My ideal data format would have the following (potentially conflicting) properties:

I can make some sense of it with a text editor. This would not involve eyeballing arrays with 10^6 elements, just getting a sense of what’s in a file, what the variables are called, etc. I am find with the occasional binary blob embedded in there, or externalized into the filesystem.
A competent programmer should be able to make a rudimentary reader for it that supports 95% of the features out there in the programming language of her choice in 2–5 days, 100% in 10 days.

I don’t have a particular gripe with HDF5, I am just concerned that if something breaks (which happens very commonly with software), I am not in a position to fix it.

Topic		Replies	Views
Proposal: working with larger than memory data in hdf5 format using HDF5Arrays (implementation of DiskArrays.jl for HDF5) Data hdf5	11	1730	November 4, 2020
(De-)Serialize N-dimensional arrays in julia New to Julia question , package , serialization	28	2198	June 4, 2021
A future for JLD2? Community jld2	56	9793	July 19, 2020
A Julia-compatible alternative to zarr Data data-compression	19	4452	December 18, 2019
How to optimaly save in JLD or HDF5 many Any arrays General Usage hdf5	0	649	January 18, 2017

Human-readable externalization for multi-dimensional arrays

Related topics