File for mmap with an unknown number of structs

Consider a struct (for MWE)

struct Foo{T1, T2}
    id::T1
    typ::T2
    start::Date
    stop::Date
end

My algorithm works in two stages:

  1. I am generating a large number of these with fixed T1 and T2, want to write that out into a file (this pass is write-only, sequential, no need to look up previous values), but the dimension is unknown ex ante,
  2. I want to Mmap.mmap this file to a Vector{Foo{T1,T2}} and work on it; this pass is read-only, random access.

I have read the docstring for mmap, and considered using write, but apparently write does not support custom structs. Should I just write a method for Foo above, writeing out the fields sequentially, eg

function write(io::IO, x::Foo)
    write(io, x.id, x.typ, x.start, x.stop)
end

or is there anything that requires special care for reading later with mmap? I am thinking about memory alignment especially, eg if T1==Int and T2=Int8.

Update: of course write(io, ...) does not work in general, because of padding. I made it work with reinterpret and also unsafe_write, but advice on whether I am really doing the right thing would be appreciated. Self-contained MWE below.

struct Record{T1, T2}
    id::T1
    ix::T2
    date::Date
end

random_Date() = Date(rand(1900:2100), rand(1:12), rand(1:28))
random_record(T1, T2) = Record(rand(T1), rand(T2), random_Date())

T1 = Int16
T2 = Int8
path = "/tmp/test"

## random array
x = [random_record(T1, T2) for _ in 1:200]

############################################################
# write elementwise -- this DOES NOT WORK, because of padding
############################################################

@generated function write_struct(io, x)
    @assert isbits(x)
    if Base.isstructtype(x)
        Expr(:block, [:(write_struct(io, x.$fn)) for fn in fieldnames(x)]...)
    else
        :(write(io, x))
    end
end

io = open(path, "w")
for elt in x
    write_struct(io, elt)
end
close(io)

## read
io = open(path, "r")
y = Mmap.mmap(io, Vector{Record{T1, T2}}, (length(x),))

x == y                          # false, because of:
sizeof(x[1])                    # 11
sizeof(y[1])                    # 16

############################################################
# write with reinterpret -- this does work
############################################################

function write_reinterpret(io, x)
    @assert isbits(typeof(x))
    write(io, reinterpret(UInt8, [x]))
end

io = open(path, "w")
for elt in x
    write_reinterpret(io, elt)
end
close(io)

## read
io = open(path, "r")
y = Mmap.mmap(io, Vector{Record{T1, T2}}, (length(x),))
x == y                          # true

############################################################
# write with reinterpret -- this DOES WORK
############################################################

function write_struct_unsafe(io, x::T) where {T}
    @assert isbits(T)
    unsafe_write(io, Ref(x), sizeof(x)) # am I using this the right way?
end

io = open(path, "w")
for elt in x
    write_struct_unsafe(io, elt)
end
close(io)

## read
io = open(path, "r")
y = Mmap.mmap(io, Vector{Record{T1, T2}}, (length(x),))
x == y                          # true

Related discussion:
https://github.com/JuliaLang/julia/issues/10140
https://github.com/JuliaLang/julia/issues/10354

Have you seen https://github.com/Keno/StructIO.jl? It handles packing and alignment calculations, and is used for e.g. inspection of big complex structs defined by various object file formats.

I may misunderstand, but my impression is that StructIO.jl solves a different problem: packing to a well-defined, system-independent format. Whereas I need a format that I can mmap to directly.

My current working hypothesis is that for isbits types, Julia follows the padding conventions of C, and unsafe_write works with that, so I should be OK, especially if I use sizeof since that takes padding into account.

Nevertheless, if would be very helpful is someone experienced with mmap could look at my code and tell me if I am doing something bogus. I have no prior experience with mmap.

Yes.

Yes. Julia structs are ABI compatible with C, except when there’s bug (currently the only ones I’m aware of are (SIMD) vectors and Int128 on x86)

1 Like

Thanks! Related: how to use Mmap.Anonymous? This gives an error:

julia> Mmap.mmap(Mmap.Anonymous(), Vector{UInt8}, (10,))
ERROR: MethodError: no method matching position(::Base.Mmap.Anonymous)
Closest candidates are:
  position(::Base.Filesystem.File) at filesystem.jl:203
  position(::Base.Libc.FILE) at libc.jl:91
  position(::IOStream) at iostream.jl:69
  ...
Stacktrace:
 [1] mmap(::Base.Mmap.Anonymous, ::Type{Array{UInt8,1}}, ::Tuple{Int64}) at ./mmap.jl:102

Is this a bug? I tried looking for examples in the tests, but could not find one.

Mmap.mmap(Mmap.Anonymous(), Vector{UInt8}, (10,), 0) works. The default value for this argument (offset) seems to be problematic and it does seem like a bug.