Saving data a scan at a time and reading as a contiguous block

Jake · July 17, 2023, 1:00pm

I am looking at collecting data from an instrument. The data comes in multiple channels of one to sixteen channels, for this example lets use 4. It is collected with a specified frame size, for the sake of argument call it 100. This will give a matrix of 100x4 for each frame of data. I would like to store the data after each frame is collected. If we have 10 frames of data then there will be 10 of these 100x4 matrices saved.
When the data is read, I would like it to be read back as one matrix of size 1000x4, or just one channel of length 1000. The actual data size may be much bigger.
In addition, I would like to store a structure for each channel giving identifying information.
This is being done on a Raspberry PI 4 with 8 Gig of RAM.
My preference is for a pure Julia implementation. I suspect the HDF5 data format may be best but perhaps Avro, Arrow, or Parquet should be considered. It seems that in general JLD2.jl is preferred over HDF5.jl. However, HDF5Logger.jl looks like it might most closely meet my needs. I think it is the first I will try but any advice or comments will be appreciated.

devel-chm · July 17, 2023, 1:40pm

If you use scan data as (#chan) x (#samples) or 4x100 arrays,
you could just append each chunk of data to the file and it will
produce a file with layout 4xN.

The header structure could be an array of structures with the
channel information and parameters.

Jake · July 17, 2023, 9:20pm

Looking at some more background information based on the previous comment, I searched for append to a file. I came across this topic which is most useful. This led me to the section in the Arrow manual that describes the writer function. An example is

julia> writer = open(Arrow.Writer, "test.arrow");

julia> for j = 1:400:100_000
       A = Tables.table(transpose(reshape(1:400, 4, 100)) .+ j)
       Arrow.write(writer, A)
       end
julia> close(writer)

julia> table = Arrow.Table("test.arrow")
Arrow.Table with 24900 rows, 4 columns, and schema:
 :Column1  Int64
 :Column2  Int64
 :Column3  Int64
 :Column4  Int64

julia> table[1][1:3]
3-element Vector{Int64}:
  2
  6
 10

This successfully writes out data from 4 channels sampled at 51.2 kHz in 1 second chunks. That works in both Float32 and Float64 using the SD card on PI for a rate of 16 Mbits/s. Sometimes on the first invocation it gives a buffer overrun error. I assume this is due to JIT compilation.

Now my wish is to also save some structures to the same file to contain the test and channel metadata. I see that there is an Arrow.struct but no example on how to save it, let alone save it in conjunction with my table data.

Is this possible?

mkitti · July 18, 2023, 12:14am

How much do you know about the data before you start collecting data?

My recommedation in any case is that you create a template file and fill that in.

For example, you could create a contiguous HDF5 dataset with the exact dimensions that you plan to collect. Then, use h5ls -va to locate the byte offset of the dataset.

On the Raspberry Pi, make a copy of the file, seek to the appropriate offset, and then start writing.

With HDF5, you could also set up a virtual dataset to stitch the individual matrices back together.

Jake · July 18, 2023, 2:07am

I set up the data acquisition parameters before I start to acquire the file, so it is certainly possible to create the file beforehand and then fill it in. I am assuming that in the scenario that you describe, that I would be using HDF5.jl, not JLD2.jl. It would also be beneficial to have a method of stopping the data collection early and still be able to access the data collected to that point in time.

It is getting late here so I will look at this further tomorrow. If there is an example or tutorial that you can point me to, that uses the technique you describe, that will certainly be appreciated. I am new to both Arrow and HDF5.

mkitti · July 18, 2023, 2:15am

Here is the first part:

julia> using HDF5

# Create the template, get the dataset offset
julia> h5open("template.h5", "w", libver_bounds=v"1.12") do h5f
           h5ds = create_dataset(h5f, "data", UInt8, (100,4), alloc_time = :early)
           HDF5.API.h5d_get_offset(h5ds) |> Int
       end
2048

# Check that file size fits, 2048 bytes of header + 400 bytes of data
julia> filesize("template.h5")
2448

# Make a copy of the template
julia> cp("template.h5", "data00.h5")
"data00.h5"


# Acquire the data
julia> A = rand(UInt8, 100, 4);

# Write the data into the template. Note that we do not need HDF5.jl for this part
julia> open("data00.h5", "r+") do io
           seek(io, 2048)
           write(io, A)
       end
400

# Load the data
julia> data = h5open("data00.h5") do h5f
           h5ds = h5f["data"]
           h5ds[:,:]
       end
100×4 Matrix{UInt8}:
 0xfe  0x93  0x0b  0x41
 0xae  0x39  0xd8  0x05
 0x3e  0x50  0xe9  0xd9
 0xa6  0x99  0xfd  0xeb
 0xc0  0x7a  0x3a  0x18
 0x02  0x9d  0x5c  0x76
 0x9f  0x0e  0xba  0x3c
 0x01  0xa8  0xbf  0x3b
 0x39  0x55  0x47  0xeb
 0xa9  0x6c  0xd6  0xdd
    ⋮              
 0xa7  0x0b  0x8f  0x77
 0xf2  0xdd  0x02  0x22
 0xa6  0x43  0x6e  0x77
 0x6d  0x44  0x58  0x76
 0xe5  0xd6  0x44  0xdc
 0x7f  0xff  0x79  0x0f
 0x4a  0x42  0x15  0x92
 0xcc  0x07  0xf1  0x80
 0x5a  0x9c  0x63  0x24

# Confirm the data we got from HDF5 is the data we stored
julia> data == A
true

Let’s save some more data.

julia> more_data = [rand(UInt8, 100, 4) for i in 1:9];

julia> using Printf

julia> for i in 1:9
           filename = @sprintf("data%02d.h5",i)
           cp("template.h5", filename)
           open(filename, "r+") do io
               seek(io, 2048)
               write(io, more_data[i])
           end
       end

mkitti · July 18, 2023, 5:01am

Here’s how to assemble the individual files into a larger 1000x4 dataset.

julia> using HDF5

julia> vspace = dataspace((1000,4))
HDF5.Dataspace: (1000, 4)

julia> h5open("virtual.h5", "w", libver_bounds=v"1.12") do h5f
           vds = create_dataset(h5f, "virtual", datatype(UInt8), vspace;
               virtual=[HDF5.VirtualMapping(
                   HDF5.select_hyperslab!(copy(vspace), (HDF5.BlockRange(1+100*i; count = 100), 1:4)),
                   "./data0$i.h5",
                   "data",
                   dataspace((100,4))
               ) for i in 0:9]
           )
       end

julia> h5open("virtual.h5") do h5f
           h5f["virtual"][:,:]
       end
1000×4 Matrix{UInt8}:
...

@simonbyrne, is there a way to make this shorter? I think you’re more familiar with virtual datasets than I.

simonbyrne · July 18, 2023, 4:44pm

Unless you need to write from different processes, personally I would recommend just writing it to a single HDF5 file in the layout you want:

using HDF5
f = h5open("data.h5", "w")
d = create_dataset(f, "dataset", UInt8, dataspace((100*10, 4)))
for i = 0:9
   d[(1:100) .+ i*100, :] = rand(UInt8, 100, 4)
end
X = read_dataset(f, "dataset") # to read back
close(f)

I’m pretty sure we provide Raspberry Pi-compatible binaries for HDF5, but if not, please open an issue.

If you want to use virtual datasets, then you can also use pattern matching so you don’t need to specify all the files ahead of time:

vspace = dataspace((1000,4))
create_dataset(h5f, "virtual2", datatype(UInt8), vspace;
    virtual=[HDF5.VirtualMapping(
                          HDF5.hyperslab(vspace, (HDF5.BlockRange(1:100; count = -1), 1:4)),
                          "./data0%b.h5",
                          "data",
                          dataspace((100,4))
                      )]
)

Jake · July 18, 2023, 11:29pm

if I replace the above line in the first example with

d = create_dataset(f, "dataset", UInt8, (100*10, 4))

then things seem to work.

And HDF5 can be added to the Raspberry PI, no problems there.

Jake · July 19, 2023, 12:02am

Going a bit further in my use case with your simple example,

julia> using HDF5

julia> d = create_dataset(f, "dataset4", Int8, (100*10, 4))

julia> for i = 0:9
          d[(1:100) .+ i*100, 2:2:4] = rand(Int8, 100, 2)
          end

julia> for i = 0:9
          d[(1:100) .+ i*100, [1, 3]] = rand(Int8, 100, 2)
          end
ERROR: MethodError: no method matching setindex!(::HDF5.Dataset, ::Matrix{Int8}, ::UnitRange{Int64}, ::Vector{Int64})

Closest candidates are:
  setindex!(::HDF5.Dataset, ::Array{T}, ::Union{Colon, Int64, AbstractRange{Int64}}...) where T
   @ HDF5 C:\Users\jakez\.julia\packages\HDF5\aiZLs\src\datasets.jl:317
  setindex!(::HDF5.Dataset, ::AbstractArray, ::Union{Colon, Int64, AbstractRange{Int64}}...)
   @ HDF5 C:\Users\jakez\.julia\packages\HDF5\aiZLs\src\datasets.jl:401
  setindex!(::HDF5.Dataset, ::Any, ::Union{Colon, Int64, AbstractRange{Int64}}...)
   @ HDF5 C:\Users\jakez\.julia\packages\HDF5\aiZLs\src\datasets.jl:389
  ...

It seems that it can accept an abstract vector but not an actual vector for indexing into the array.

I can copy all into a temporary Matrix and then save all at once but that would be extra copying.

What is the recommendation for this?

simonbyrne · July 19, 2023, 3:39pm

You can use two writes:

 for i = 0:9
    X = rand(Int8, 100, 2)
    d[(1:100) .+ i*100, 1] = view(X, :, 1)
    d[(1:100) .+ i*100, 3] = view(X, :, 2)
end

mkitti · July 19, 2023, 3:52pm

We might be able to support this via some more sophisticated hyperslab selection.

Topic		Replies	Views
Storing huge amount of data efficiently Performance performance , jld2 , numerics , io , arrow	15	2676	February 24, 2023
Proposal: working with larger than memory data in hdf5 format using HDF5Arrays (implementation of DiskArrays.jl for HDF5) Data hdf5	11	1726	November 4, 2020
Storing and accessing large jagged array with julia General Usage question , data , filesystem , hep	33	4023	October 31, 2023
How to optimaly save in JLD or HDF5 many Any arrays General Usage hdf5	0	649	January 18, 2017
Write structure to HDF5 file and read a compound type from HDF5 into Julia structure General Usage hdf5	3	3783	June 18, 2019

Saving data a scan at a time and reading as a contiguous block

Related topics