Lazy read string from iostream?

ryofurue · November 27, 2023, 6:06am

It seems that read(iostream, String) reads the entire file into the memory at once. (Please correct me if I’m mistaken.) I’ve just deduced that from this test code:

f = open("tmp.txt", "r")
s = read(f, String)
typeof(s) #-> just a String.
m = match(someregex, s)

Is there an idiom to read a string lazily from an iostream? You sometimes want to read from a pipe and stop reading as soon as you find the substring you wanted. Such a solution would be more general than reading the whole thing at once.

I learned that read(stream, String) is what is recommended:

mkitti · November 27, 2023, 2:27pm

Could you be more specific about how you would like to read?

Do you want to read bytes? Try just read(f)
Do you want to read lines? Try eachline(f).

jling · November 27, 2023, 2:35pm

normal regex engin won’t work, you need non-backtracking ones:

you might be intereted:
https://docs.julialang.org/en/v1/base/io-network/#Base.readuntil

ryofurue · November 28, 2023, 5:21am

My question is more about a potential general idiom you use by default.

Here is an analogy. I often store a numeric (eg, Float64) array in a plain binary file. Some people copy the binary data from a file into the main memory at once. What if the file is too big to store in the memory? They would say that they read the data chunk by chunk.

But I don’t do that. I just use mmap by default. In that way, I don’t have to care about whether the file is big or not.

Then, what is your default way for String? . . . Having said that, I realized, after posting my initial message, that I have to think about the danger of infinite read. What if my regular expression doesn’t match any part of the input string, which can be infinitely long if it comes from a pipe?

I would need a way to give up in the middle. This is the difference of reading from a pipe from mmap on a disk file.

That has made me realize that the default way should be either your eachline() or @jling 's readuntil().

jling · November 28, 2023, 11:16am

my answer is use a real file format… copying memory into disk file and back is just bad for so many reasons, especially since it’s almost 2024

stevengj · November 28, 2023, 4:47pm

You can mmap a string using StringViews.jl

There are also ways to do eachline with a pre-allocated buffer, so that you don’t allocate a new string for each line. e.g. using the ViewReader package, or the upcoming copyuntil function in Julia ~~1.10~~ 1.11.

ryofurue · November 29, 2023, 3:15am

Of course! But, for temporary files, plain binary is extremely convenient. You produce temporary files in one program, use them in other programs, and delete them after all processings. Writing a plain binary is a one-liner and reading it as an mmapped array is a two-line code (plus using Mmap). There is no advantage in “a real file format” for this use.

In addition, some of my colleagues still produce plain binary files, which I sometimes have to use.

I agree! In our field, “netCDF” is the de facto standard and it’s so much better than plain binary. The problem is (or used to be) that its Fortran interface is so, so tedious to use that you are very reluctant to write a program to produce netCDF files in Fortran. I shrink from it.

Now that netCDF’s Python and Julia interfaces are so much better, there is no excuse to use plain binary, except as temporary files (see above), as long as you use Python or Julia. But, there are still Fortran-only people, even among young scientists, in our field . . .

Topic		Replies	Views
How to mmap a string General Usage question	18	2766	September 18, 2020
Avoiding small strings and using IOBuffer Performance strings , io	7	580	December 1, 2023
How to read lines from a file with a reusable buffer? General Usage	3	682	April 6, 2021
IO write/read <strlen> <string> General Usage binaryio , strings , io	15	986	August 17, 2021
Binary-read fixed-length String New to Julia question , binaryio	8	1770	August 12, 2021

Lazy read string from iostream?

Related topics