File IO Buffers too small?

I typically have the task to read and process large binary data files consisting of relatively small consecutive records (< 30 kB). If I read each record one after the other the read performance lacks far behind of what the storage device is able to provide for larger batches, especially on external disk drives.
The following MWE shows this for a Samsung T7 SSD on Windows (Julia 1.8.2):

recordsize = 32000
totalsize = 10000 * recordsize

batch = 100

buffer1 = zeros(UInt8, recordsize)
buffer2 = zeros(UInt8, recordsize*batch)

function readbuffer(io::IOStream, buffer, n)
    for i in 1:n
        readbytes!(io, buffer)
    end
end

io = open("some very large file.dat")

readbuffer(io, buffer1, 1) # force compilation
@time readbuffer(io, buffer1, div(totalsize, recordsize))
@time readbuffer(io, buffer2, div(totalsize, recordsize*batch))
close(io)

(To reproduce you have to make sure that the file is not already in the file cache!!! Seconds runs will ony show your RAM speed…)
The times output correspond to 160 MB/s and 430 MB/s resp. The ratio gets even much worse for rotating media.
My current solution is to allocate a large buffer and fill it with a separate thread. But the usage is quite cumbersome with all the extra synchronization handling and the code has to be specially adapted.
Wouldn’t it be better to make the buffer size for an IOStream configurable? At the moment it is quite small (32k if I remember correctly). Some time ago there was a suggestion on Stackoverflow (can we customize the file open buffer size in Julia - Stack Overflow), which looked not too complicated. But it is beyond my skills as a pure Julia user to judge, if this makes sense…
Maybe this is only a problem on Windows, but I can’t test on Linux which maybe does a better read-ahead.

You can use GitHub - JuliaIO/BufferedStreams.jl: Fast composable IO streams.

But indeed, I have also found the performance of IOStream to be unacceptable for many IO tasks where you typically read many small values. One issue with IOStream is that it takes a lock on every IO operation in order to be thread-safe.

For binary files, I typically resort to memory mapping the file and wrapping it in an IOBuffer.

So instead of

io = open("some very large file.dat")

I do

using Mmap
io = IOBuffer(Mmap.mmap("some very large file.dat"))
6 Likes

Thanks for the suggestions! I will try BufferedStreams.jl, but since it does no background read-ahead to fill the buffer it will not come close the full disk speed, I guess.

Memory-mapping several GB large files is unfortunately not really an option on my machine and as far as I know, there is no interface in Julia to partially drop the mapping for the already processed part.

For my use case with medium sized data chunks (10-60 kB) the locking operation should not show up. I use a Julia Condition variable for synchronisation in my home-grown double buffering approach and are able to get to full disk speed with an 8 MB buffer. This is of course different, if you read chunks of only a few bytes at a time…

What’s the issue with mmapping a several GB large file?

If you read it from start to the end, the whole file will be mapped to your process. Managing such a large memory map is not fast. You can’t have it all in real RAM at the same time, so it depends on the operation system if it does read and drop the right pages for you…

It is typically offset by not having to copy things into a buffer.

Yes, but copying is quite fast. Even without CPU caches a recent PC achieves about 16 GB/s. So, for the 1.6 GB example below, the copy overhead is only about 0.1 s.

I extended the test code by the mmapped version. First through an IOBuffer and then touching the mapped array directly. Every time a new file is used so that the data is filecache cold.

recordsize = 16000
totalsize = 100000 * recordsize

batch = 100

buffer1 = zeros(UInt8, recordsize)
buffer2 = zeros(UInt8, recordsize*batch)

function readbuffer(io::IO, buffer, n)
    for i in 1:n
        readbytes!(io, buffer)
    end
end

io1 = open(largefile1)
readbuffer(io1, buffer1, 1)
print("Time to read $(div(totalsize, recordsize)) * $(div(recordsize, 1000)) kB: ")
@time readbuffer(io1, buffer1, div(totalsize, recordsize))
close(io1)

io2 = open(largefile2)
print("Time to read $(div(totalsize, recordsize*batch)) * $(div(recordsize*batch, 1000)) kB: ")
@time readbuffer(io2, buffer2, div(totalsize, recordsize*batch))
close(io2)

using Mmap
print("Time to mmap $(div(totalsize, 1000000)) MB: ")
@time iob = IOBuffer(Mmap.mmap(largefile3))
print("Time to read mmapped $(div(totalsize, recordsize)) * $(div(recordsize, 1000)) kB: ")
@time readbuffer(iob, buffer1, div(totalsize, recordsize))

map = Mmap.mmap(largefile4)

print("Time to touch every page: $(div(totalsize, recordsize)) * $(div(recordsize, 1000)) kB: ")
@time touchit = map[1:4096:totalsize]

nothing

The output is:

Time to read 100000 * 16 kB:   8.510265 seconds (1 allocation: 16 bytes)
Time to read 1000 * 1600 kB:   3.860538 seconds (2 allocations: 32 bytes)
Time to mmap 1600 MB:   0.007760 seconds (19 allocations: 1.539 KiB)
Time to read mmapped 100000 * 16 kB:   5.388585 seconds (12.27 k allocations: 694.661 KiB, 0.11% compilation time)
Time to touch every page: 100000 * 16 kB:   5.302174 seconds (8.66 k allocations: 810.503 KiB, 0.14% compilation time)

As you can see, the large buffer version comes close to the device speed of the USB-SSD. The record at a time version is about 2.5 times slower and the mmapped version in between. Times for only touching one byte of a page and copying all to a destination buffer are practically the same (memcopy is fast…)

As an IOStream has a richer interface than IO (eg.g stat), the IOBuffer over MMap is also not a plugin-solution for every case. I still think a configurable buffer size would be most helpful…

Not sure if this will be applicable to your situation, but I spent more time than I should testing methods for reading and writing binary files (although mine are in range 100MB - 4GB usually).
I have to also admit, that I don’t quite understand how to use IOBuffers, so I was either reading to arrays directly or treating mmaped memory as an array (since that’s what my files contain).

From my experience, mmaping was the most convenient + if you use a SSD drive, you can actually make more than one read call at the same time, making mmap scale to some degree with e.g. threads.
What I would do is to run a threaded for loop that transcribes parts of the mmaped array to a preallocated array in RAM. And on my older SATA SSD I get around 1200MB/s reads with 4 threads (at least according to benchmarks).

Hm, if I remember correclty SATA is limited to 600 MB/s. But the problem mostly shows up for external devices conncected over an interface with a low IO-ops rate (USB 3.0 in my case). On an internal NVME SSD with nearly 1M IOPs/s even the small buffer of IOStreams is not a real problem.

I was able to get hold of a similar sized linux system and here are the times of the benchmark above using the same external SSD:

Time to read 100000 * 16 kB:   3.645069 seconds (1 allocation: 16 bytes)
Time to read 1000 * 1600 kB:   3.664056 seconds (2 allocations: 32 bytes)
Time to mmap 1600 MB:   0.000050 seconds (19 allocations: 1.617 KiB)
Time to read mmaped 100000 * 16 kB:   3.672997 seconds (12.27 k allocations: 694.661 KiB, 0.16% compilation time)
Time to touch every page: 100000 * 16 kB:   3.682058 seconds (8.66 k allocations: 810.503 KiB, 0.24% compilation time)

This is with kernel 5.15 and the new in kernel ntfs3 driver. As I suspected linux does a much better read-ahead and the buffer size of the uvlib layer does not matter. All times are the same within the noise and reach the maximum speed of the disk.

Unfortunately some people (like me) have to stick to Windows…

This is really interesting and might answer why I have observed strongly varying performance while writing data to files - all my tests were done on Windows.

Maybe it has something to do with caching that Windows performs on recently/frequently accessed files? In my reading tests I have tried to compensate for that by manually clearing read files from the cache, but I am not sure if it was successful.

So maybe also for reading there is an optimum batch size for one read call that would bring you closer to max of your drive? I have tested it on 3 different PCs and 7 or 8 drives and the pattern was stable, just different for each drive.
I on the other hand should repeat those tests on Linux to see if that solves the problem for my use case as well.

I believe[d] mmap is just a good thing, “several GB” just trivial, and mmap well supported on all platforms Julia supports. I looked into it just in case, and I assume all the pros are shared, and the cons don’t apply (at least for reading) as for in SQLite case: Memory-Mapped I/O

There are advantages and disadvantages to using memory-mapped I/O. Advantages include:
[…]
But there are also disadvantages:

[…]
1. An I/O error on a memory-mapped file cannot be caught and dealt with by SQLite. Instead, the I/O error causes a signal which, if not caught by the application, results in a program crash.
[…]
4. Windows is unable to truncate a memory-mapped file. Hence, on Windows, if an operation such as [VACUUM]

Because of the potential disadvantages, memory-mapped I/O is disabled by default.

Many databases use memory-mapped files to store data on disk, and this passage was seen as an unsubstantiated attack on those engines.

Nothing could be further from the truth. […]

I was saying that you can (and should) do better than the operating system for I/O management because the needs of the OS are not the needs of a DBMS.

Memory map files are not designed for database workloads. For many problems, they work great, but any database engine that primarily uses memory-mapped files as a persistence mechanism cannot be used as a reliable storage option.

As time goes by, the probability of losing or corrupting data on a DBMS using memory-mapped files converges to 1.

I’m not sure, are you sure it doesn’t read ahead? I mean I’m not sure the package does it, or what’s built into Julia, but I think it might happen at a lower level (and for mmap), e.g. by the OS. Possibly it depends on the OS. At least you can test it.

A memory-mapped file can be larger than the address space. The view of the memory-mapped file is limited by OS memory constraints, but that’s only the part of the file you’re looking at at one time. […]

In MS Windows, look at the MapViewOfFile function. It effectively takes a 64-bit file offset and a 32-bit length.
[…]
MapViewOfFile takes a 64-bit length on a 64-bit machine

I assume you use 64-bit, for 32-bit there are complications (see answer).

I finally managed to download Julia and build it at my local Windows machine. Then I followed the suggestion of the stackoverflow post mentioned in my original post. Diff to v1.8.3 is as follows:

diff --git a/base/iostream.jl b/base/iostream.jl
index 23dfb53256..09d805ea90 100644
--- a/base/iostream.jl
+++ b/base/iostream.jl
@@ -74,6 +74,12 @@ iswritable(s::IOStream) = ccall(:ios_get_writable, Cint, (Ptr{Cvoid},), s.ios)!=

 isreadable(s::IOStream) = ccall(:ios_get_readable, Cint, (Ptr{Cvoid},), s.ios)!=0

+function sizehint!(s::IOStream, sz::Integer)
+    bad = @_lock_ios s ccall(:ios_growbuf, Cint, (Ptr{Cvoid}, Csize_t), s.ios, sz) != 0
+    systemerror("sizehint!", bad)
+end
+
+
 """
     truncate(file, n)

diff --git a/src/support/ios.c b/src/support/ios.c
index c0f1c92572..72bddbc72a 100644
--- a/src/support/ios.c
+++ b/src/support/ios.c
@@ -708,6 +708,11 @@ static void _buf_init(ios_t *s, bufmode_t bm)
     s->size = s->bpos = 0;
 }

+int ios_growbuf(ios_t *s, size_t sz)
+{
+    return _buf_realloc(s, sz) == NULL;
+}
+
 char *ios_take_buffer(ios_t *s, size_t *psize)
 {
     char *buf;
diff --git a/src/support/ios.h b/src/support/ios.h
index e5d83ec974..0c5b0b57fc 100644
--- a/src/support/ios.h
+++ b/src/support/ios.h
@@ -95,6 +95,7 @@ JL_DLLEXPORT int ios_eof_blocking(ios_t *s);
 JL_DLLEXPORT int ios_flush(ios_t *s);
 JL_DLLEXPORT int ios_close(ios_t *s) JL_NOTSAFEPOINT;
 JL_DLLEXPORT int ios_isopen(ios_t *s);
+JL_DLLEXPORT int ios_growbuf(ios_t *s, size_t sz);
 JL_DLLEXPORT char *ios_take_buffer(ios_t *s, size_t *psize);  // nul terminate and release buffer to caller
 // set buffer space to use
 JL_DLLEXPORT int ios_setbuf(ios_t *s, char *buf, size_t size, int own) JL_NOTSAFEPOINT;

Now using a 1 MB buffer my benchmark gives the following results:

Time to read 100000 * 16 kB: 3.991065 seconds (1 allocation: 16 bytes)
Time to read 1000 * 1600 kB: 3.991829 seconds (2 allocations: 32 bytes)
Time to mmap 1600 MB: 0.006230 seconds (19 allocations: 1.539 KiB)
Time to read mmaped 100000 * 16 kB: 5.203444 seconds (12.27 k allocations: 694.661 KiB, 0.12% compilation time)
Time to touch every page: 100000 * 16 kB: 4.995575 seconds (8.66 k allocations: 810.503 KiB, 0.16% compilation time)

For my real application this means that processing times dropped from about 1h to half an hour, which is really a huge improvement.

BufferedStreams.jl or memory mapping the file are not really a solution in my case, since my software has a mode where it tracks measurement data written to a file in real time for quick-look purposes. But mmap as exposed by Julia does not allow for a dynamic remapping and Buffered Streams don’t implement stat (maybe I could stat the underlying file, but I’m not sure this is save…)

Is there a chance a patch like the above could be merged to Julia? Maybe adding a method to sizehint! is not optimal to really describe what it does though…

2 Likes

When rotating media does your allocation continue to climb? Why are you declaring the size and passing to zeros? Can’t you just declare the size in zeros?

Opening a PR with this seems like a good start.