What we need to do IO in Julia with guaranteed memory safety

(I was going to post this as an issue on the Julia github, but I still feel I’m a little naive in this topic so I wanted to open it up to discussion here, see what people think, then post a real issue. I know there’s a little bit about this stuff in existing issues, but it seemed important to make sure everything was explicit and in one place, and I didn’t see an issue like that.)

I’ve just gotten through a fairly big IO project in which large amounts of data has to be loaded from and written from Vector{UInt8} buffers efficiently. Early on, I had hoped that the new reinterpret function would make it possible for me to do this without the use of Ptrs, but I soon came to realize that this does not seem like it will be possible any time soon. I also fell victim to the tribulations of using pointers to wrap arrays in a language with garbage collection (can be very counter-intuitive if you’re coming from C or C++), so I therefore emphatically agree that ideally we do not want users to make use of pointers. The more of this kind of code is around the worse it will be for the stability of the Julia ecosystem.

Here are the things which I believe have to be done to get there, and a few suggestions.

reinterpret must be a no-op (including indexing)

(except possibly for some simple bounds checking) Since IO is performance critical it is absolutely crucial that reinterpret or its future equivalent be a no-op, and that indexing the resulting array must be an absolutely minimal operation. I’ve looked at the code for reinterpret and improving it certainly looks daunting. I have to confess that I don’t fully understand everything that’s being done there.

Suggestions:

  • From looking at reinterpret it seems possible that we will need a dedicated function for certain alignments (i.e. wherever a no-op is possible and safe). Ideally that’s not the case, but I think we should be open to the possibility.
  • It’s ok if non-scalar indexing a ReinterpretArray returns new copied Arrays. After all, indexing of Arrays returns copies not views. This is not the current behavior of ReinterpretArray, right now indexing returns another ReinterpretArray which is essentially a view (the inconsistency with Array is confusing to me, I don’t see why this was done). This should be simple to achieve by an unsafe_wrap followed by a copyto! of a newly constructed array (it has zero overhead, I’ve tested it). You can always have view return either a ReinterpretArray or a SubArray. Iterating over either of these should be a no-op for each scalar index.

need to be able to reinterpret from arrays of smaller elements to scalars of a larger data type as a no-op

Granted, if ReinterpretArray works perfectly this wouldn’t be strictly necessary but it’s something we definitely should have. Without it, packages must be carefully designed around it, and that does not seem ideal. Right now if you try to do this you instead get a 1-element array, the creation of which comes with significant overhead.

users need a clear understanding of what happens for non-Array AbstractArray types

It should go without saying that the aforementioned no-ops are only possible in cases where data is actually adjacent. If that’s not the case, one would have to resort to some alternative, perhaps copying code into an IOBuffer. I think we may need a version of reinterpret that throws an error if you attempt to use it on some array where data is not adjacent. I think the possibility of this occurring is the primary reason for the current complexity of ReinterpretArray. At the moment things seem a little precarious: for example if someone were to create an IO buffer with a custom array type (e.g. MMapArray) a user would have to go in and figure out how to actually unsafe_wrap the data from raw Arrays and this might be a little scary if you don’t know the layout. (I don’t know of any reason why you’d actually need some sort of MMapArray but I’ve been paranoid about something like that happening.)


Ok, am I missing any major points here? Did I say anything idiotic? Please share your thoughts if you have any and I’ll post an issue. Perhaps if consensus coalesces quickly I can start work on a PR. Thanks all!

5 Likes

I have no suggestions to offer regarding a redesign (or even whether it is needed), but I really hope that once v0.7 is released, the core developers working on this will have time for more documentation and tutorials on how to do bit-twiddling performantly.

Currently, I am sometimes confused if something is not optimized because it is not how I should do things, or simply WIP.

3 Likes

Well, I guess one of my motivations for opening this thread is to ask whether the kind of usage I’m talking about above is really that crucial or common. It’s really most vital in cases where you want to do random access on very large buffers such as with MMap. (Though I’m still very confused about why the performance seems to matter much… any CPU time should be overwhelmed by actual hardware IO time, but in casual observation it still seemed to matter. I suppose this will come a bigger deal in the future as we see more of things like NVMe and Optane. I really should do a comprehensive benchmark of memory mapped data…)

I would think that IPC would be the most performance critical case, and I’m imagining that the devs might say that we should use read. However, I would think that even in those cases reinterpret and the like would open up a lot of design options and you wouldn’t want to be limited to read because it requires being extremely selective about when to copy data.

It might be helpful if you give some examples. I’m pretty new to this stuff so, like I said above, not too sure if what I’m talking about is a universal requirement.

In my experience, SSD utilization for some architectures shows up as CPU in eg top. YMMV.

Would love to, but I need to sit down with v0.7 again at some point since many of my benchmarks got an amazing speedup, while others got a horrible slowdown (I guess not because of core language issues, but because deprecation warnings?). I plan to re-benchmark when it is frozen and packages catch up a bit.

1 Like

I have a feeling from @jameson’s comments in the other thread that I’m drastically underestimating the utility of IO objects. I think they may already do a lot of what I’m describing here. Will look into this more thoroughly and post back here.

The first potential worry that I’m seeing is that read essentially has @noinline, so that will limit things…

One thing is that reinterpret has to be aware of alignment and I think the redesign was chiefly meant to make it a safe and valid operation.

If you know that the alignment of your data is correct (maybe you have written the data from Julia) you could just use Mmap.mmap("filename", Vector{MyType}) instead of going through a Vector{UInt8} and then using reinterpret on that.

2 Likes

It’s a little more complicated than that because it’s not just a single Vector{MyType}, it’s lots of vectors containing lots of different types aranged in a non-trivial format, as well as a bunch of FlatBuffer structs.

I think IO and Array are interesting non-orthogonal APIs. This is sort of the same idea as using mmap when the file is known to be constant size and fit in memory. The IOBuffer constructor is similar to that, and can help overlay complex zero-copy formats on any byte-indexed array:

julia> typeof(IOBuffer(view(UInt8[], :)))
Base.GenericIOBuffer{SubArray{UInt8,1,Array{UInt8,1},Tuple{Base.Slice{Base.OneTo{Int64}}},true}}

But then, unstructured formats are not enjoyable to work with, so you might wrap this again in an access-api. But then, that could just contain another flat byte buffer, and keep nesting further.

If you are interested in what I ultimately wound up doing see my repo. As I’ve indicated, I’m going to investigate how realistic it would have been for me to replace my Vector{UInt8} coming from mmap with an IOBuffer wrapped around the mmap array using only read and skip. I’m expecting that it will have significantly higher overhead than what I did in some cases, but if I’m wrong than IO may indeed have everything I was looking for.

Would a pointer-esque api be ok if it were memory-safe? Something like:

struct SafePointer{T}
  base::Ptr{Void} # start of the region
  len::UInt64 # length of the region
  offset::UInt64 # offset to the T

  function SafePointer{T}(base::Ptr{Void}, len::UInt64, offset::UInt64}
    @assert 0 <= offset <= len - sizeof(T)
    new(base, len, offset)
  end
end

function +(sp::SafePointer{T}, i::UInt64) where {T}
  SafePointer(sp.base, sp.len, offset+i)
end

We use something similar for implementing disk-backed betrees. It handles most of the pointer arithmetic and does bounds checks:

julia> struct Foo
       x::Int64
       y::PagedVector{Float32}
       end

julia> paged = Paged{Foo}(p)
Pageds.Paged{Foo}(Ptr{Void} @0x00000000032d9fe0)

julia> @v paged.y
3-element Pageds.PagedVector{Float32}:
 0.0
 0.0
 0.0

julia> @v paged.y[2] 
0.0f0

julia> @v paged.y[4]
ERROR: BoundsError: attempt to access 3-element Pageds.PagedVector{Float32} at index [4]

julia> y2 = @a paged.y[2] # internal pointer
Pageds.Paged{Float32}(Ptr{Void} @0x00000000032d9ffc)

julia> @v y2
0.0f0

julia> @v y2 = 7
Ptr{Float32} @0x00000000032d9ffc

julia> @v paged.y
3-element Pageds.PagedVector{Float32}:
 0.0
 7.0
 0.0

It all boils down to pointer operations which optimize really well:

julia> f(paged, x) = @v paged.y[2] = x

julia> @code_lowered f(paged, 3)
CodeInfo(:(begin 
        nothing
        return (Pageds.unsafe_store!)((Pageds.get_address)((Pageds.get_address)(paged, Val{:y}), 2), x)
    end))

julia> @code_warntype f(paged, 3)
Variables:
  #self# <optimized out>
  paged::Pageds.Paged{Foo}
  x::Int64

Body:
  begin 
      $(Expr(:inbounds, false))
      # meta: location /home/jamie/raicode/src/Pageds/Pageds.jl get_address 95
      # meta: location /home/jamie/raicode/src/Pageds/Pageds.jl # line 99:
      # meta: location /home/jamie/raicode/src/Pageds/Pageds.jl Type 18
      goto 7
      7: 
      # meta: pop location
      SSAValue(0) = $(Expr(:new, Pageds.Paged{Pageds.PagedVector{Float32}}, :((Base.bitcast)(Ptr{Void}, (Base.add_int)((Base.bitcast)(UInt64, (Core.getfield)(paged, :ptr)::Ptr{Void}), 0x0000000000000008)::UInt64))))
      # meta: pop location
      # meta: pop location
      $(Expr(:inbounds, :pop))
      SSAValue(1) = $(Expr(:invoke, MethodInstance for get_address(::Pageds.Paged{Pageds.PagedVector{Float32}}, ::Int64), :(Pageds.get_address), SSAValue(0), 2))
      $(Expr(:inbounds, false))
      # meta: location /home/jamie/raicode/src/Pageds/Pageds.jl unsafe_store! 142
      # meta: location /home/jamie/raicode/src/Pageds/Pageds.jl unsafe_store! 122
      # meta: location /home/jamie/raicode/src/Pageds/Pageds.jl # line 126:
      SSAValue(2) = (Base.pointerset)((Base.bitcast)(Ptr{Float32}, (Core.getfield)(SSAValue(1), :ptr)::Ptr{Void}), (Base.sitofp)(Float32, x::Int64)::Float32, 1, 1)::Ptr{Float32}
      # meta: pop location
      # meta: pop location
      # meta: pop location
      $(Expr(:inbounds, :pop))
      return SSAValue(2)
  end::Ptr{Float32}

julia> @code_native f(paged, 3)
	.text
Filename: REPL[27]
	pushq	%rbp
	movq	%rsp, %rbp
	pushq	%rbx
	pushq	%rax
	movq	%rsi, %rbx
Source line: 99
	movq	(%rdi), %rax
	addq	$8, %rax
	movq	%rax, -16(%rbp)
Source line: 1
	movabsq	$get_address, %rax
	leaq	-16(%rbp), %rdi
	movl	$2, %esi
	callq	*%rax
Source line: 126
	xorps	%xmm0, %xmm0
	cvtsi2ssq	%rbx, %xmm0
	movss	%xmm0, (%rax)
Source line: 1
	addq	$8, %rsp
	popq	%rbx
	popq	%rbp
	retq
	nopl	(%rax)

That would probably work fine, but I can’t help but feel that it would be unfortunate to have to resort to any pointer type at all in a garbage collected language that ideally wouldn’t have you bothering with such things. It would be much nicer to work with some other AbstractArray or even IO type, in my opinion. Not that that’s necessarily a valid reason not to use wrapped pointers.

As I’ve said, I don’t actually know if the use case I’ve described is nearly as universal as I think it is. I’d have to read your paper thoroughly to understand whether the use cases are similar enough that it would make sense to ask for the same solution.

This little bit of code tested as being about the same as reinterpret for single values, and it probably should reuse the IOBuffer yet. May require a seek(b, 0) every hundred(?) or so calls to prevent the buffer growing indefinitely when reusing, and UInt8/64 read specifics - number of bytes/elements to read, as the eof won’t be as reliable

function mwsr(data, getFloat=false) ##Mark /Write /Seek /Read
  b = IOBuffer();
  mark(b);
  write(b, data);
  reset(b);
  if(getFloat) return read(b, Float64); end;
  return read(b); ##UInt8's
  end;

It’s still a little unclear how/why arrow files need to be reinterpreted, but as long as a parse/convert definitely isn’t sufficient, it’s probably fine