Lazy read string from iostream?

It seems that read(iostream, String) reads the entire file into the memory at once. (Please correct me if I’m mistaken.) I’ve just deduced that from this test code:

f = open("tmp.txt", "r")
s = read(f, String)
typeof(s) #-> just a String.
m = match(someregex, s)

Is there an idiom to read a string lazily from an iostream? You sometimes want to read from a pipe and stop reading as soon as you find the substring you wanted. Such a solution would be more general than reading the whole thing at once.

I learned that read(stream, String) is what is recommended:

Could you be more specific about how you would like to read?

  1. Do you want to read bytes? Try just read(f)
  2. Do you want to read lines? Try eachline(f).
1 Like

normal regex engin won’t work, you need non-backtracking ones:

you might be intereted:
https://docs.julialang.org/en/v1/base/io-network/#Base.readuntil

1 Like

My question is more about a potential general idiom you use by default.

Here is an analogy. I often store a numeric (eg, Float64) array in a plain binary file. Some people copy the binary data from a file into the main memory at once. What if the file is too big to store in the memory? They would say that they read the data chunk by chunk.

But I don’t do that. I just use mmap by default. In that way, I don’t have to care about whether the file is big or not.

Then, what is your default way for String? . . . Having said that, I realized, after posting my initial message, that I have to think about the danger of infinite read. What if my regular expression doesn’t match any part of the input string, which can be infinitely long if it comes from a pipe?

I would need a way to give up in the middle. This is the difference of reading from a pipe from mmap on a disk file.

That has made me realize that the default way should be either your eachline() or @jling 's readuntil().

my answer is use a real file format… copying memory into disk file and back is just bad for so many reasons, especially since it’s almost 2024 :slight_smile:

2 Likes

You can mmap a string using StringViews.jl

There are also ways to do eachline with a pre-allocated buffer, so that you don’t allocate a new string for each line. e.g. using the ViewReader package, or the upcoming copyuntil function in Julia 1.10 1.11.

1 Like

Of course! But, for temporary files, plain binary is extremely convenient. You produce temporary files in one program, use them in other programs, and delete them after all processings. Writing a plain binary is a one-liner and reading it as an mmapped array is a two-line code (plus using Mmap). There is no advantage in “a real file format” for this use.

In addition, some of my colleagues still produce plain binary files, which I sometimes have to use.

I agree! In our field, “netCDF” is the de facto standard and it’s so much better than plain binary. The problem is (or used to be) that its Fortran interface is so, so tedious to use that you are very reluctant to write a program to produce netCDF files in Fortran. I shrink from it.

Now that netCDF’s Python and Julia interfaces are so much better, there is no excuse to use plain binary, except as temporary files (see above), as long as you use Python or Julia. But, there are still Fortran-only people, even among young scientists, in our field . . .

1 Like