Buffers are People Too: IO Package Developers, Please Consider In-Memory Operations

ExpandingMan · December 9, 2019, 3:33pm

I wanted to bring the following to the attention of anyone who might be developing any IO-related package.

Today with “microservices” and docker containers being common, it is often the case that IO takes place over a network connection rather than in the file system. As a specific but important example, files stored in AWS S3 are (exclusively, as far as I know) fetched over HTTP. Retrieving such a file usually means that you will fetch it into memory without necessarily needing to ever use the file system. In some cases, particularly if you program is running in a docker container, you may not have any disk space allocated and want to avoid using the file system altogether. Having methods available for supporting buffers and IO streams also has other advantages like making it easier to support memory mapping (currently Mmap.mmap returns a Vector{UInt8}) and different types of streams.

For this reason, it’s important in the development of IO packages to support in-memory buffers on equal footing with files, both in code and in documentaiton. I recommend that any function that accepts a file name as an argument have methods that accept the following:

An AbstractVector{UInt8} (or at least Vector{UInt8}).
An IO object.
A filename.

Supporting these cases will give your IO package much wider applicability, I hope you’ll at least keep in mind the possibility of supporting these, thanks!

(As you might have guessed, I may have a few PR’s incoming to some IO packages in the coming weeks.)

stevengj · December 9, 2019, 3:59pm

Isn’t it sufficient to just support an IO object? If the caller has an array of bytes, they can always wrap it with IOBuffer(array).

ExpandingMan · December 9, 2019, 4:02pm

If you somehow want to support random access you need a Vector{UInt8}. But yes, if that’s not an issue, you can just have a method that wraps it in IOBuffer.

stevengj · December 9, 2019, 4:06pm

seek?

But if an API developer is used to working with files, then they probably want the semantics of IO objects anyway.

ExpandingMan · December 9, 2019, 4:10pm

In my experience, writing random access code using seek is incredibly inconvenient, one is usually much better off using Vector{UInt8}. I also have not confirmed whether it’s possible to use seek in a performant way particularly when using memory mapping.

The most important case for supporting a Vector{UInt8} reall is random access to memory mapped files.

My original recommendation was really just for convenience. Of course I know packages designed primarily for IO will usually not truly support random access into a Vector{UInt8} and packages designed primarily for random access to memory mapped files probably will not use IO (but will do open(read, file) or something like that somewhere).

Topic		Replies	Views
Read file to io General Usage question	1	379	August 9, 2023
IOStream and Filename Use Internals & Design	18	1760	March 17, 2018
How to create an IOBuffer backed by a String General Usage strings , io	2	412	September 1, 2022
What are the benefits of using IOBuffer()? General Usage question , io	20	4229	May 25, 2022
IO type that wraps an AbstractVector{UInt8}? Data question	4	293	May 31, 2024

Buffers are People Too: IO Package Developers, Please Consider In-Memory Operations

Related topics