File for mmap with an unknown number of structs

question

#1

Consider a struct (for MWE)

struct Foo{T1, T2}
    id::T1
    typ::T2
    start::Date
    stop::Date
end

My algorithm works in two stages:

  1. I am generating a large number of these with fixed T1 and T2, want to write that out into a file (this pass is write-only, sequential, no need to look up previous values), but the dimension is unknown ex ante,
  2. I want to Mmap.mmap this file to a Vector{Foo{T1,T2}} and work on it; this pass is read-only, random access.

I have read the docstring for mmap, and considered using write, but apparently write does not support custom structs. Should I just write a method for Foo above, writeing out the fields sequentially, eg

function write(io::IO, x::Foo)
    write(io, x.id, x.typ, x.start, x.stop)
end

or is there anything that requires special care for reading later with mmap? I am thinking about memory alignment especially, eg if T1==Int and T2=Int8.


#2

Update: of course write(io, ...) does not work in general, because of padding. I made it work with reinterpret and also unsafe_write, but advice on whether I am really doing the right thing would be appreciated. Self-contained MWE below.

struct Record{T1, T2}
    id::T1
    ix::T2
    date::Date
end

random_Date() = Date(rand(1900:2100), rand(1:12), rand(1:28))
random_record(T1, T2) = Record(rand(T1), rand(T2), random_Date())

T1 = Int16
T2 = Int8
path = "/tmp/test"

## random array
x = [random_record(T1, T2) for _ in 1:200]

############################################################
# write elementwise -- this DOES NOT WORK, because of padding
############################################################

@generated function write_struct(io, x)
    @assert isbits(x)
    if Base.isstructtype(x)
        Expr(:block, [:(write_struct(io, x.$fn)) for fn in fieldnames(x)]...)
    else
        :(write(io, x))
    end
end

io = open(path, "w")
for elt in x
    write_struct(io, elt)
end
close(io)

## read
io = open(path, "r")
y = Mmap.mmap(io, Vector{Record{T1, T2}}, (length(x),))

x == y                          # false, because of:
sizeof(x[1])                    # 11
sizeof(y[1])                    # 16

############################################################
# write with reinterpret -- this does work
############################################################

function write_reinterpret(io, x)
    @assert isbits(typeof(x))
    write(io, reinterpret(UInt8, [x]))
end

io = open(path, "w")
for elt in x
    write_reinterpret(io, elt)
end
close(io)

## read
io = open(path, "r")
y = Mmap.mmap(io, Vector{Record{T1, T2}}, (length(x),))
x == y                          # true

############################################################
# write with reinterpret -- this DOES WORK
############################################################

function write_struct_unsafe(io, x::T) where {T}
    @assert isbits(T)
    unsafe_write(io, Ref(x), sizeof(x)) # am I using this the right way?
end

io = open(path, "w")
for elt in x
    write_struct_unsafe(io, elt)
end
close(io)

## read
io = open(path, "r")
y = Mmap.mmap(io, Vector{Record{T1, T2}}, (length(x),))
x == y                          # true

Related discussion:


https://github.com/JuliaLang/julia/issues/10354


#3

Have you seen https://github.com/Keno/StructIO.jl? It handles packing and alignment calculations, and is used for e.g. inspection of big complex structs defined by various object file formats.


#4

I may misunderstand, but my impression is that StructIO.jl solves a different problem: packing to a well-defined, system-independent format. Whereas I need a format that I can mmap to directly.

My current working hypothesis is that for isbits types, Julia follows the padding conventions of C, and unsafe_write works with that, so I should be OK, especially if I use sizeof since that takes padding into account.

Nevertheless, if would be very helpful is someone experienced with mmap could look at my code and tell me if I am doing something bogus. I have no prior experience with mmap.


#5

Yes.


#6

Yes. Julia structs are ABI compatible with C, except when there’s bug (currently the only ones I’m aware of are (SIMD) vectors and Int128 on x86)


#7

Thanks! Related: how to use Mmap.Anonymous? This gives an error:

julia> Mmap.mmap(Mmap.Anonymous(), Vector{UInt8}, (10,))
ERROR: MethodError: no method matching position(::Base.Mmap.Anonymous)
Closest candidates are:
  position(::Base.Filesystem.File) at filesystem.jl:203
  position(::Base.Libc.FILE) at libc.jl:91
  position(::IOStream) at iostream.jl:69
  ...
Stacktrace:
 [1] mmap(::Base.Mmap.Anonymous, ::Type{Array{UInt8,1}}, ::Tuple{Int64}) at ./mmap.jl:102

Is this a bug? I tried looking for examples in the tests, but could not find one.


#8

Mmap.mmap(Mmap.Anonymous(), Vector{UInt8}, (10,), 0) works. The default value for this argument (offset) seems to be problematic and it does seem like a bug.