read!(io,Vector{MyStruct} not equivalent to loop over read with eachindex

I have a packed structure of data in a binary stream that I wish to
read in julia. I defined the type struct and then added a read method
to dispatch for that type:

struct MyStruct
    A::Int16
    B::Int16
    C::Int64
    D::Int16
    E::Int16
    F::UInt32
end

import Base.read

function read(s::IO, ::Type{Caen_v20_0x1})                                                          
    a = Base.read(s, Int16)                                                                 
    b = Base.read(s, Int16)                                                                 
    c = Base.read(s, Int64)                                                                 
    d = Base.read(s, Int16)                                                                 
    e = Base.read(s, Int16)                                                             
    f = Base.read(s, UInt32)                                                                
    MyStruct(a,b,c,d,e,f)                                                                
end

With this I can use read to read a single MyStruct value.
val = read(io,MyStruct) or in a loop with

vMyStruct = Vector{MyStruct}(undef,10)
for i in eachindex(vMyStruct)
    vMyStruct[i] = read(io,eltype(vMyStruct))
end

However, when I use read!(io,vMyStruct) the result is not
equivalent. I looked at the code for read! and it appears
that the IO is just an unsafe_read of the binary stream and
not repeated use of the read method as defined.

Looking at the code for read in base it seems that the looping
over the read method is in there but that the check to see if an
unsafe_read would give the correct result may not be correct.

From io.jl in base:

function read!(s::IO, a::AbstractArray{T}) where T
    if isbitstype(T) && (a isa Array || a isa FastContiguousSubArray{T,<:Any,<:Array{T}})
        GC.@preserve a unsafe_read(s, pointer(a), sizeof(a))
    else
        for i in eachindex(a)
            a[i] = read(s, T)
        end
    end
    return a
end

It seems like the FastContiguousSubArray test maybe should be
&& instead of || which might make the hand loop with eachindex
unneccessary.

I vaguely remember this working the last time I worked with the I/O
but maybe I am misremembering things…

Is this the intended operation for read! in this case?

It seems you don’t actually need to define read (or write) in the Vector{MyStruct} example, as the code below already runs fine.

struct MyStruct
    A::Int16
    B::Int16
    C::Int64
    D::Int16
    E::Int16
    F::UInt32
end

ms1 = MyStruct(Int16(1), Int16(2), Int64(3), Int16(4), Int16(5), UInt32(6))
ms2 = MyStruct(Int16(10), Int16(20), Int64(30), Int16(40), Int16(50), UInt32(60))

test_vms = [ms1, ms2]
write("test.bin", test_vms)

vms = Vector{MyStruct}(undef, 2)
read!("test.bin", vms)
julia> vms
2-element Vector{MyStruct}:
 MyStruct(1, 2, 3, 4, 5, 0x00000006)
 MyStruct(10, 20, 30, 40, 50, 0x0000003c)

For saving or reading a single MyStruct it does seem you do need to define write and read. (I would imagine there being a functional default behaviour when isbits. How else could writing Vector{MyStruct} already work?)


I can confirm there is a discrepancy between

julia> vms = Vector{MyStruct}(undef, 2);
julia> read!("test.bin", vms)
2-element Vector{MyStruct}:
 MyStruct(1, 2, 3, 4, 5, 0x00000006)
 MyStruct(10, 20, 30, 40, 50, 0x0000003c)

and

function Base.read(s::IO, ::Type{MyStruct})                                                          
    a = Base.read(s, Int16)                                                                 
    b = Base.read(s, Int16)                                                                 
    c = Base.read(s, Int64)                                                                 
    d = Base.read(s, Int16)                                                                 
    e = Base.read(s, Int16)                                                             
    f = Base.read(s, UInt32)                                                                
    return MyStruct(a,b,c,d,e,f)                                                                
end

vms = Vector{MyStruct}(undef, 2)
open("test.bin", "r") do file
    for i in eachindex(vms)
        vms[i] = read(file, MyStruct)
    end
end
julia> vms
2-element Vector{MyStruct}:
 MyStruct(1, 2, 12884902085, 0, 0, 0x00050004)
 MyStruct(6, 0, 846109868042, 30, 0, 0x00000000)

But to me it makes sense this won’t work, as write for Vector{MyStruct} also needs to store the overhead of Vector, such as its length.
EDIT: You don’t need to store the length, as you can deduce this from the size of the output file and the size of each element in the Vector. E.g. write("len_test.bin", UInt8.([1, 2, 3, 2, 1])) gives a file with 01 02 03 02 01 in hex, i.e. the length 5 does not appear. Since the file is 5 bytes, and every element (UInt8) occupies a single byte, you indeed know that the length must be 5.


If you would use

function Base.write(io::IO, ms::MyStruct)                                                        
   write(io, ms.A)
   write(io, ms.B)
   write(io, ms.C)
   write(io, ms.D)
   write(io, ms.E)
   write(io, ms.F)                                                               
end

open("test.bin", "w") do file
   for ms in test_vms
       write(file, ms)
  end
end

vms = Vector{MyStruct}(undef, 2)
open("test.bin", "r") do file
   for i in eachindex(vms)
       vms[i] = read(file, MyStruct)
   end
end

you do get the intended

julia> vms
2-element Vector{MyStruct}:
 MyStruct(1, 2, 3, 4, 5, 0x00000006)
 MyStruct(10, 20, 30, 40, 50, 0x0000003c)
EDIT: Hex outputs

The output files using write!(::String, ::Vector{MyStruct}) and the loop approach are respectively (split into lines for clarity)

01 00 
02 00 
0B 02 00 00 
03 00 00 00 00 00 00 00 
04 00 
05 00 
06 00 00 00 
0A 00 
14 00 
0B 02 00 00 
1E 00 00 00 00 00 00 00 
28 00 
32 00 
3C 00 00 00

and

01 00 
02 00 
03 00 00 00 00 00 00 00 
04 00 
05 00 
06 00 00 00 
0A 00 
14 00 
1E 00 00 00 00 00 00 00 
28 00
32 00 
3C 00 00 00

If anyone knows the meaning of the 0B 02 00 00 in each MyStruct, and why it appears at such a seemingly arbitrary location, that would be interesting.

2 Likes

Your second case writes the data from Julia in the packed
manner that I am reading the stream from.

My question is should read!(io,Vector{MyStruct}) not dispatch
to my specific read(io,Type::MyStruct) method in this case?

The source suggests that for julia 1.10.4 (the version I am using here)
the else branch is what I think would be the correct operation. I’m
just not sure what changes need to be made to keep things performant.

Oh, okay, I see. So basically, you want to know why read! is implemented the way it is, and why it’s not just (or more often) a loop, which would have better dispatch behaviour?

In that case, I don’t know, good question :slight_smile: ! I agree that the dispatching loop would be more natural, though less performant (see benchmark below).
(I guess one could change the implementation of read! to check if read is implemented for the eltype, and use a loop if so. But that would give worse performance for e.g. Vector{Int} (unless you also add exceptions for the built-in Ints and Floats). Another option would be to add a flag force_iterate, which overrules the current checks.)

By the way, note that write is even worse, under no circumstances calling a lower level write (i.e. for MyStruct):

function write(s::IO, a::Array)
    if isbitstype(eltype(a))
        return GC.@preserve a unsafe_write(s, pointer(a), sizeof(a))
    else
        error("`write` is not supported on non-isbits arrays")
    end
end

(There is also a more generic version write(::IO, A::AbstractArray), which does call write(::IO, ::eltype(A)), but that version will not be used when we have a simple Vector (i.e. in write("test.bin", test_vms) ).)

Benchmark

As a simple benchmark, consider

using BenchmarkTools

struct MyStruct
    A::Int16
    B::Int16
    C::Int64
    D::Int16
    E::Int16
    F::UInt32
end

function mywrite(io::IO, ms::MyStruct)
    nb_bytes = write(io, ms.A)
    nb_bytes += write(io, ms.B)
    nb_bytes += write(io, ms.C)
    nb_bytes += write(io, ms.D)
    nb_bytes += write(io, ms.E)
    nb_bytes += write(io, ms.F)
    return nb_bytes
end

function mywrite(io::IO, vms::Vector{MyStruct})
    nb_bytes = 0
    for i = eachindex(vms)
        nb_bytes += mywrite(io, vms[i])
    end
    return nb_bytes
end

mywrite(filepath::String, vms::Vector{MyStruct}) = open(file -> mywrite(file, vms), filepath, "w")

function myread(io::IO, ::Type{MyStruct})                                                          
    a = Base.read(io, Int16)                                                                 
    b = Base.read(io, Int16)                                                                 
    c = Base.read(io, Int64)                                                                 
    d = Base.read(io, Int16)                                                                 
    e = Base.read(io, Int16)                                                             
    f = Base.read(io, UInt32)                                                                
    return MyStruct(a,b,c,d,e,f)                                                                
end

function myread!(io::IO, vms::Vector{MyStruct})
    for i in eachindex(vms)
        vms[i] = myread(io, MyStruct)
    end
    return vms
end

myread!(filepath::String, vms::Vector{MyStruct}) = open(file -> myread!(file, vms), filepath, "r")


ms1 = MyStruct(Int16(1), Int16(2), Int64(3), Int16(4), Int16(5), UInt32(6))
ms2 = MyStruct(Int16(10), Int16(20), Int64(30), Int16(40), Int16(50), UInt32(60))

test_vms = [ms1, ms2]
vms = Vector{MyStruct}(undef, length(test_vms))

@btime write("write.bin", $test_vms)
#   69.200 μs (12 allocations: 744 bytes)
# 48 bytes
@btime mywrite("mywrite.bin", $test_vms)
#   70.200 μs (24 allocations: 936 bytes)
# 40 bytes

@btime myread!("mywrite.bin", $vms)
#   27.200 μs (12 allocations: 744 bytes)
# 2-element Vector{MyStruct}:
# MyStruct(1, 2, 3, 4, 5, 0x00000006)
# MyStruct(10, 20, 30, 40, 50, 0x0000003c)
@btime read!("write.bin", $vms)
#   26.700 μs (11 allocations: 728 bytes)
# 2-element Vector{MyStruct}:
# MyStruct(1, 2, 3, 4, 5, 0x00000006)
# MyStruct(10, 20, 30, 40, 50, 0x0000003c)


test_vms2 = [ms1 for _ in 1:10^6]
vms2 = Vector{MyStruct}(undef, length(test_vms)

@btime write("write2.bin", $test_vms2)
#   23.769 ms (13 allocations: 760 bytes)
# 24000000
@btime mywrite("mywrite2.bin", $test_vms2)
#   174.469 ms (6000013 allocations: 91.55 MiB)  # (Where are all these allocations coming from?)
# 20000000


@btime myread!("mywrite2.bin", $vms2);
#   137.196 ms (12 allocations: 744 bytes)
@btime read!("write2.bin", $vms2);
#   4.029 ms (11 allocations: 728 bytes)

So the performance difference is negligible for small Vectors, but it grows to be significant for longer ones.

I figured out why I remembered this working before. There are a number
of data types based on MyStruct and some have Vector{Int16} of size
F as in

struct MyStructv
    A::Int16
    B::Int16
    C::Int64
    D::Int16
    E::Int16
    F::UInt32
    G::Vector{Int16}  # Has size F
end

In this case, it is not an isbitstype so the appropriate method for
read(io,MyStructv) is called by read!().

In the MyStruct case the type isbits so the raw byte-wise IO
is used.

It appears that I need to define my own method for
read!(::io,::AbstractArray{MyStruct}) when my data type isbits.