Buffers are People Too: IO Package Developers, Please Consider In-Memory Operations

I wanted to bring the following to the attention of anyone who might be developing any IO-related package.

Today with “microservices” and docker containers being common, it is often the case that IO takes place over a network connection rather than in the file system. As a specific but important example, files stored in AWS S3 are (exclusively, as far as I know) fetched over HTTP. Retrieving such a file usually means that you will fetch it into memory without necessarily needing to ever use the file system. In some cases, particularly if you program is running in a docker container, you may not have any disk space allocated and want to avoid using the file system altogether. Having methods available for supporting buffers and IO streams also has other advantages like making it easier to support memory mapping (currently Mmap.mmap returns a Vector{UInt8}) and different types of streams.

For this reason, it’s important in the development of IO packages to support in-memory buffers on equal footing with files, both in code and in documentaiton. I recommend that any function that accepts a file name as an argument have methods that accept the following:

  • An AbstractVector{UInt8} (or at least Vector{UInt8}).
  • An IO object.
  • A filename.

Supporting these cases will give your IO package much wider applicability, I hope you’ll at least keep in mind the possibility of supporting these, thanks!

(As you might have guessed, I may have a few PR’s incoming to some IO packages in the coming weeks.)

7 Likes

Isn’t it sufficient to just support an IO object? If the caller has an array of bytes, they can always wrap it with IOBuffer(array).

If you somehow want to support random access you need a Vector{UInt8}. But yes, if that’s not an issue, you can just have a method that wraps it in IOBuffer.

seek?

But if an API developer is used to working with files, then they probably want the semantics of IO objects anyway.

In my experience, writing random access code using seek is incredibly inconvenient, one is usually much better off using Vector{UInt8}. I also have not confirmed whether it’s possible to use seek in a performant way particularly when using memory mapping.

The most important case for supporting a Vector{UInt8} reall is random access to memory mapped files.

My original recommendation was really just for convenience. Of course I know packages designed primarily for IO will usually not truly support random access into a Vector{UInt8} and packages designed primarily for random access to memory mapped files probably will not use IO (but will do open(read, file) or something like that somewhere).