I am looking at collecting data from an instrument. The data comes in multiple channels of one to sixteen channels, for this example lets use 4. It is collected with a specified frame size, for the sake of argument call it 100. This will give a matrix of 100x4 for each frame of data. I would like to store the data after each frame is collected. If we have 10 frames of data then there will be 10 of these 100x4 matrices saved.
When the data is read, I would like it to be read back as one matrix of size 1000x4, or just one channel of length 1000. The actual data size may be much bigger.
In addition, I would like to store a structure for each channel giving identifying information.
This is being done on a Raspberry PI 4 with 8 Gig of RAM.
My preference is for a pure Julia implementation. I suspect the HDF5 data format may be best but perhaps Avro, Arrow, or Parquet should be considered. It seems that in general JLD2.jl is preferred over HDF5.jl. However, HDF5Logger.jl looks like it might most closely meet my needs. I think it is the first I will try but any advice or comments will be appreciated.
If you use scan data as (#chan) x (#samples) or 4x100 arrays,
you could just append each chunk of data to the file and it will
produce a file with layout 4xN.
The header structure could be an array of structures with the
channel information and parameters.
Looking at some more background information based on the previous comment, I searched for append to a file. I came across this topic which is most useful. This led me to the section in the Arrow manual that describes the writer function. An example is
julia> writer = open(Arrow.Writer, "test.arrow");
julia> for j = 1:400:100_000
A = Tables.table(transpose(reshape(1:400, 4, 100)) .+ j)
Arrow.write(writer, A)
end
julia> close(writer)
julia> table = Arrow.Table("test.arrow")
Arrow.Table with 24900 rows, 4 columns, and schema:
:Column1 Int64
:Column2 Int64
:Column3 Int64
:Column4 Int64
julia> table[1][1:3]
3-element Vector{Int64}:
2
6
10
This successfully writes out data from 4 channels sampled at 51.2 kHz in 1 second chunks. That works in both Float32 and Float64 using the SD card on PI for a rate of 16 Mbits/s. Sometimes on the first invocation it gives a buffer overrun error. I assume this is due to JIT compilation.
Now my wish is to also save some structures to the same file to contain the test and channel metadata. I see that there is an Arrow.struct but no example on how to save it, let alone save it in conjunction with my table data.
Is this possible?
How much do you know about the data before you start collecting data?
My recommedation in any case is that you create a template file and fill that in.
For example, you could create a contiguous HDF5 dataset with the exact dimensions that you plan to collect. Then, use h5ls -va
to locate the byte offset of the dataset.
On the Raspberry Pi, make a copy of the file, seek to the appropriate offset, and then start writing.
With HDF5, you could also set up a virtual dataset to stitch the individual matrices back together.
I set up the data acquisition parameters before I start to acquire the file, so it is certainly possible to create the file beforehand and then fill it in. I am assuming that in the scenario that you describe, that I would be using HDF5.jl, not JLD2.jl. It would also be beneficial to have a method of stopping the data collection early and still be able to access the data collected to that point in time.
It is getting late here so I will look at this further tomorrow. If there is an example or tutorial that you can point me to, that uses the technique you describe, that will certainly be appreciated. I am new to both Arrow and HDF5.
Here is the first part:
julia> using HDF5
# Create the template, get the dataset offset
julia> h5open("template.h5", "w", libver_bounds=v"1.12") do h5f
h5ds = create_dataset(h5f, "data", UInt8, (100,4), alloc_time = :early)
HDF5.API.h5d_get_offset(h5ds) |> Int
end
2048
# Check that file size fits, 2048 bytes of header + 400 bytes of data
julia> filesize("template.h5")
2448
# Make a copy of the template
julia> cp("template.h5", "data00.h5")
"data00.h5"
# Acquire the data
julia> A = rand(UInt8, 100, 4);
# Write the data into the template. Note that we do not need HDF5.jl for this part
julia> open("data00.h5", "r+") do io
seek(io, 2048)
write(io, A)
end
400
# Load the data
julia> data = h5open("data00.h5") do h5f
h5ds = h5f["data"]
h5ds[:,:]
end
100×4 Matrix{UInt8}:
0xfe 0x93 0x0b 0x41
0xae 0x39 0xd8 0x05
0x3e 0x50 0xe9 0xd9
0xa6 0x99 0xfd 0xeb
0xc0 0x7a 0x3a 0x18
0x02 0x9d 0x5c 0x76
0x9f 0x0e 0xba 0x3c
0x01 0xa8 0xbf 0x3b
0x39 0x55 0x47 0xeb
0xa9 0x6c 0xd6 0xdd
⋮
0xa7 0x0b 0x8f 0x77
0xf2 0xdd 0x02 0x22
0xa6 0x43 0x6e 0x77
0x6d 0x44 0x58 0x76
0xe5 0xd6 0x44 0xdc
0x7f 0xff 0x79 0x0f
0x4a 0x42 0x15 0x92
0xcc 0x07 0xf1 0x80
0x5a 0x9c 0x63 0x24
# Confirm the data we got from HDF5 is the data we stored
julia> data == A
true
Let’s save some more data.
julia> more_data = [rand(UInt8, 100, 4) for i in 1:9];
julia> using Printf
julia> for i in 1:9
filename = @sprintf("data%02d.h5",i)
cp("template.h5", filename)
open(filename, "r+") do io
seek(io, 2048)
write(io, more_data[i])
end
end
Here’s how to assemble the individual files into a larger 1000x4 dataset.
julia> using HDF5
julia> vspace = dataspace((1000,4))
HDF5.Dataspace: (1000, 4)
julia> h5open("virtual.h5", "w", libver_bounds=v"1.12") do h5f
vds = create_dataset(h5f, "virtual", datatype(UInt8), vspace;
virtual=[HDF5.VirtualMapping(
HDF5.select_hyperslab!(copy(vspace), (HDF5.BlockRange(1+100*i; count = 100), 1:4)),
"./data0$i.h5",
"data",
dataspace((100,4))
) for i in 0:9]
)
end
julia> h5open("virtual.h5") do h5f
h5f["virtual"][:,:]
end
1000×4 Matrix{UInt8}:
...
@simonbyrne, is there a way to make this shorter? I think you’re more familiar with virtual datasets than I.
Unless you need to write from different processes, personally I would recommend just writing it to a single HDF5 file in the layout you want:
using HDF5
f = h5open("data.h5", "w")
d = create_dataset(f, "dataset", UInt8, dataspace((100*10, 4)))
for i = 0:9
d[(1:100) .+ i*100, :] = rand(UInt8, 100, 4)
end
X = read_dataset(f, "dataset") # to read back
close(f)
I’m pretty sure we provide Raspberry Pi-compatible binaries for HDF5, but if not, please open an issue.
If you want to use virtual datasets, then you can also use pattern matching so you don’t need to specify all the files ahead of time:
vspace = dataspace((1000,4))
create_dataset(h5f, "virtual2", datatype(UInt8), vspace;
virtual=[HDF5.VirtualMapping(
HDF5.hyperslab(vspace, (HDF5.BlockRange(1:100; count = -1), 1:4)),
"./data0%b.h5",
"data",
dataspace((100,4))
)]
)
if I replace the above line in the first example with
d = create_dataset(f, "dataset", UInt8, (100*10, 4))
then things seem to work.
And HDF5 can be added to the Raspberry PI, no problems there.
Going a bit further in my use case with your simple example,
julia> using HDF5
julia> d = create_dataset(f, "dataset4", Int8, (100*10, 4))
julia> for i = 0:9
d[(1:100) .+ i*100, 2:2:4] = rand(Int8, 100, 2)
end
julia> for i = 0:9
d[(1:100) .+ i*100, [1, 3]] = rand(Int8, 100, 2)
end
ERROR: MethodError: no method matching setindex!(::HDF5.Dataset, ::Matrix{Int8}, ::UnitRange{Int64}, ::Vector{Int64})
Closest candidates are:
setindex!(::HDF5.Dataset, ::Array{T}, ::Union{Colon, Int64, AbstractRange{Int64}}...) where T
@ HDF5 C:\Users\jakez\.julia\packages\HDF5\aiZLs\src\datasets.jl:317
setindex!(::HDF5.Dataset, ::AbstractArray, ::Union{Colon, Int64, AbstractRange{Int64}}...)
@ HDF5 C:\Users\jakez\.julia\packages\HDF5\aiZLs\src\datasets.jl:401
setindex!(::HDF5.Dataset, ::Any, ::Union{Colon, Int64, AbstractRange{Int64}}...)
@ HDF5 C:\Users\jakez\.julia\packages\HDF5\aiZLs\src\datasets.jl:389
...
It seems that it can accept an abstract vector but not an actual vector for indexing into the array.
I can copy all into a temporary Matrix and then save all at once but that would be extra copying.
What is the recommendation for this?
You can use two writes:
for i = 0:9
X = rand(Int8, 100, 2)
d[(1:100) .+ i*100, 1] = view(X, :, 1)
d[(1:100) .+ i*100, 3] = view(X, :, 2)
end
We might be able to support this via some more sophisticated hyperslab selection.