It seems that read(iostream, String) reads the entire file into the memory at once. (Please correct me if I’m mistaken.) I’ve just deduced that from this test code:
f = open("tmp.txt", "r")
s = read(f, String)
typeof(s) #-> just a String.
m = match(someregex, s)
Is there an idiom to read a string lazily from an iostream? You sometimes want to read from a pipe and stop reading as soon as you find the substring you wanted. Such a solution would be more general than reading the whole thing at once.
I learned that read(stream, String) is what is recommended:
My question is more about a potential general idiom you use by default.
Here is an analogy. I often store a numeric (eg, Float64) array in a plain binary file. Some people copy the binary data from a file into the main memory at once. What if the file is too big to store in the memory? They would say that they read the data chunk by chunk.
But I don’t do that. I just use mmap by default. In that way, I don’t have to care about whether the file is big or not.
Then, what is your default way for String? . . . Having said that, I realized, after posting my initial message, that I have to think about the danger of infinite read. What if my regular expression doesn’t match any part of the input string, which can be infinitely long if it comes from a pipe?
I would need a way to give up in the middle. This is the difference of reading from a pipe from mmap on a disk file.
That has made me realize that the default way should be either your eachline() or @jling 's readuntil().
There are also ways to do eachline with a pre-allocated buffer, so that you don’t allocate a new string for each line. e.g. using the ViewReader package, or the upcoming copyuntil function in Julia 1.10 1.11.
Of course! But, for temporary files, plain binary is extremely convenient. You produce temporary files in one program, use them in other programs, and delete them after all processings. Writing a plain binary is a one-liner and reading it as an mmapped array is a two-line code (plus using Mmap). There is no advantage in “a real file format” for this use.
In addition, some of my colleagues still produce plain binary files, which I sometimes have to use.
I agree! In our field, “netCDF” is the de facto standard and it’s so much better than plain binary. The problem is (or used to be) that its Fortran interface is so, so tedious to use that you are very reluctant to write a program to produce netCDF files in Fortran. I shrink from it.
Now that netCDF’s Python and Julia interfaces are so much better, there is no excuse to use plain binary, except as temporary files (see above), as long as you use Python or Julia. But, there are still Fortran-only people, even among young scientists, in our field . . .