Reinterpret Int64 as 2xInt32 struct

Hey all :slight_smile:

It seems easier to work with native types when writing to a file (that is later MMapped) compared to writing structs. Since my struct has two Int32 I thought of merging them to an Int64 (by bitshifting), writing them to a file, and then Mmap them using reinterpret. Like so:

function pack(numb1::Int32, numb2::Int32)
    # Low bits  = numb1
    # High bits = numb2
    return Int64(numb1) << 32 | numb2
end

struct S
    numb2::Int32
    numb1::Int32
end

packed = pack(Int32(100), Int32(10))
println(packed) # --> 429496729610
s = reinterpret(S, [packed]) # --> S(100, 10)

I wonder two things:

  • It seems that the high bits are interpreted first and then the low bits - is this correct? Int64 here holds numb1 in the low bits and when passed to S using reinterpret it is read second (not first) hence I have numb2 first in my struct.

  • Is this an okay practice for this situation or am I missing something that can go wrong here?

Since your struct is isbits, you can mmap it directly:

julia> isbitstype(S)
true

help?> mmap
[...]
  The type is an Array{T,N} with a bits-type element of T and dimension N that determines how the bytes
  of the array are interpreted. Note that the file must be stored in binary format, and no format
  conversions are possible (this is a limitation of operating systems, not Julia).

It depends on the endianness of your machine.

Direct reinterpreting of immutable isbits structs is allowed:

julia> reinterpret(Int64, [S(100, 10)]) |> only |> bitstring
"0000000000000000000000000000101000000000000000000000000001100100"

julia> 429496729610 |> bitstring
"0000000000000000000000000110010000000000000000000000000000001010"

Endianness can mess up your code, which is why the docs for Mmap.mmap warn to take care of this carefully:

help?> mmap
[...]
  A more portable file would need to encode the word size – 32 bit or 64 bit – and endianness
  information in the header. In practice, consider encoding binary data using standard formats like HDF5
  (which can be used with memory-mapping).

In practice, how (and even if) you want to store your data on disks depends on your application and what you need to do with it later on. For simple data like yours, I’d recommend against custom serialization schemes.

1 Like

Thanks for endianness remark!, didn’t think about it.

In practice, how (and even if) you want to store your data on disks depends on your application and what you need to do with it later on.

In my case, I have to index a file for which I have to record over 100 billion 2 x 32bit numbers. I later want to Mmap them and view slices from it. I thought of Mmapping a 2 column 32 bit array (“matrix”) like:

handle = open("test.bin", "w+")
write(handle, [10, 11])
write(handle, [15, 16])
close(handle)
m = Mmap.mmap(open("test.bin", "r+"), Matrix{Int32}, (2,2))

However, then slicing seems expensive to me using e.g. m[:,1:2]… but maybe that’s just in my head cause of the syntax being more complex than m[1:2].

So I don’t really care how I store the data: structs, multiple numbers, a single number. What matters is that I can efficiently retrieve 2 x Int32 from a Mmap.

Would you say a 2 x Int32 would be a better option?

Slicing in julia creates a copy - if you @view that, it will be cheap, though unnecessary if you can just as well mmap your struct directly. No slicing/matrix operation needed.

If you always access these numbers as pairs through their column, just directly mmaping your existing S structs as a single Vector seems perfectly fine and preferable to me.

Aaah I see, I wasn’t sure how you meant to write and read them. Is this what you mean:

using Mmap

struct S
    numb2::Int32
    numb1::Int32
end

x = S(100, 10)
y = S(200, 10)

# Writing
handle = open("test.bin", "w+")
write(handle, reinterpret(Int64, [S(100, 10)]))
write(handle, reinterpret(Int64, [S(200, 15)]))
close(handle)

# Reading
m = Mmap.mmap(open("test.bin", "r+"), Vector{Int64}, 2)
reinterpret(S, @view m[1:2])

Aaah I could directly do m = Mmap.mmap(open("test.bin", "r+"), Vector{S}, 2)

1 Like

I was thinking along this:

julia> using Mmap


julia> struct S
           numb2::Int32
           numb1::Int32
       end

julia> arr = mmap("file.data", Vector{S}, 10)
10-element Vector{S}:
 S(0, 0)
 S(0, 0)
 S(0, 0)
 S(0, 0)
 S(0, 0)
 S(0, 0)
 S(0, 0)
 S(0, 0)
 S(0, 0)
 S(0, 0)

julia> for x in 1:10
         arr[x] = S(x,x)
       end

julia> arr
10-element Vector{S}:
 S(1, 1)
 S(2, 2)
 S(3, 3)
 S(4, 4)
 S(5, 5)
 S(6, 6)
 S(7, 7)
 S(8, 8)
 S(9, 9)
 S(10, 10)

shell> xxd file.data
00000000: 0100 0000 0100 0000 0200 0000 0200 0000  ................
00000010: 0300 0000 0300 0000 0400 0000 0400 0000  ................
00000020: 0500 0000 0500 0000 0600 0000 0600 0000  ................
00000030: 0700 0000 0700 0000 0800 0000 0800 0000  ................
00000040: 0900 0000 0900 0000 0a00 0000 0a00 0000  ................

As you can see, all our data is written back out. Restarting julia and opening the file with mmap again:

shell> 
[sukera@tempman my_tmp]$ julia -q
julia> struct S
           numb2::Int32
           numb1::Int32
       end

julia> using Mmap

julia> arr = mmap("file.data", Vector{S}, 10)
10-element Vector{S}:
 S(1, 1)
 S(2, 2)
 S(3, 3)
 S(4, 4)
 S(5, 5)
 S(6, 6)
 S(7, 7)
 S(8, 8)
 S(9, 9)
 S(10, 10)

Aaah yeah, forgot to mention that I do not know the number of elements beforehand. I read a 7TB file and filter specific lines

You mean only part of that file is your matrix data or you mean you only care about a subset of all of the data? Is that subset continuous?

In a sense both haha:

  • I parse info from around 50% of the lines in the 7TB file (and write them to the Mmap/file as structs)
  • I then map continuous parts from the resulting Mmap to process them further