IOStream and Filename Use


#1

would it make sense to be systematic in methods that take IOStream now also to define methods that allow filename use? many functions already do this, like write* and read*. others do not, like serialize and deserialize.

regards, /iaw


#2

Depends on your interface. If the API makes sense for both filenames and <: IO (note, not necessarily restricting to IOStream, unless you have a strong reason for this), I define methods for IO, then a method for AbstractString that takes care of files. This assumes that the operation makes sense for whole files.

But if I find that methods that work on <: IO and filenames are distinct, with the first operating on chunks of data and the latter writing multiple chunks to files, I use separate functions. The bottom line is that you should think about your interface.


#3

I understand the distinction in my own programs. alas, am not sure I understand the distinction in Base. why would Base allow read and write on filenames, while serialize and deserialize would not?

there is beauty in symmetry, because it reduces the memory requirements, not on the computer, but on (old) programmers. besides, it makes explaining easier, too: “functions typically work on <: IO and on filenames.” done.

(yes, I mean <:.)

regards,

/iaw


#4

Maybe it depends on expected use cases of the different functions. Or it could be an inconsistency. There are people reviewing the API and making proposals for consistency for 1.0.

You can use a macro:

macro withw(fcall)
    (fn,fcall.args[2]) = (fcall.args[2],:io)
    quote
        open($fn,"w") do io
            $(fcall)
        end
    end
end

@withw println("outfile.txt", "hello")

There have been discussions over the past couple of years about adding something similar to Julia that does not require a macro. Allowing destructor methods is one idea. https://github.com/JuliaLang/julia/issues/7721


#5

for my own use, this would be a great macro. the problem is that I am writing a website/book/presentation for new students. if something is not built in, then I have to explain what I am doing. at that point, I may as well just write open…close.

do you happen to know how I can non-intrusively suggest this to the people reviewing the API? I don’t want to impose, but I think it is a reasonable suggestion and easy to implement.

regards,

/iaw


#6

I knew you’d make that criticism, (which I support) so I didn’t address it. :slightly_smiling_face: Even for myself I am reluctant to use a package (say with that macro) if I’m unsure about maintenance and broad support.

Finding where to bring it up might take some work. A good start is to search pull requests to julia for mention of the API for IO. Or, similar API reviews. Sometimes several people who think about Julia a lot spend hours thinking about and discussing specific design decisions like this one. Someone complains on discourse without reading any of the debate because a feature doesn’t seem convenient for the use case in front of their nose. The developers are in general remarkably patient when responding to these criticisms.

You are most likely to get a response if you do a thorough review of the inconsistency you see. But that may be a big effort. Here is a smallish example: https://github.com/JuliaLang/julia/pull/26442

Here is a bigger one: https://github.com/JuliaLang/Juleps/blob/master/Find.md
(There are very few “juleps”)


#7

I agree it would be useful if you could list functions which accept IO arguments and which could also take a filename. methodswith(IO) will give you the (long) list of all functions which take IO arguments.


#8

it’s not a large number. about 50 methods.

there are 6 functions with IO which also already have one whose arguments start with filename::AbstractString

eachline
read (and read!)
readline
readstring
readuntil
write

there are 44 functions that take IO arguments but do not have one starting with a filename. for some, like flush, it would not make sense, either. for some, like displaysize or print_with_color, it seems less useful.

 apropos
 code_llvm
 code_native
 code_warntype
 convert
 countlines
 deserialize
 displaysize
 dump
 flush
 get
 getindex
 haskey
 in
 info
 ismarked
 join
 lock
 logging
 mark
 open
 parse
 pipeline
 print
 print_shortest
 print_with_color
 println
 readbytes!
 reset
 seekstart
 serialize
 show
 showall
 showcompact
 showerror
 unlock
 unmark
 unsafe_read
 unsafe_write
 versioninfo
 warn
 whos
 writedlm

so, let’s hope someone who is working on API consistency notices this post. :slight_smile:


#9

I don’t see any major problems with the two lists above, as the latter list is for operations that make little sense for whole files (except in very special cases).

Also, note that open of course accepts filenames, so it should be in the first list.

That said, if you feel strongly about this issue, then you should review existing issues and open a new one if it has not been discussed before. Comments here may get lost in the noise, especially now that the core devs are in “let’s get v0.7 finalized and deal with everything else later” mode.


#10

on reflection, I was wrong altogether. some of the functions can have a string as their second argument, which explains, e.g., why println() cannot have a filename as its first argument. duhhh.


#11

That went through my head, but didn’t make it into comments above. It’s why I chose println as an example. But, to solve this problem It might be useful to have a light wrapper for strings representing filenames. There is FilePaths.jl, but it has a different purpose. One disadvantage is that, if using a wrapper were adopted as a standard, it would make it less convenient for people who use a particular function more often with a filename than with an IO type. They would be forced to use the wrapper. On the other, other hand, string macros fr"..." and fw"..." might be convenient enough.

Also, there is no good way to create a stream that is closed automatically when there are no references. At least this is what I understand from reading about finalize, etc. So each function would still have a method for handling files that explicitly calls close, as is the case now. Nonetheless, there is at least one package that uses finalizer, finalize to automatically close a stream: https://github.com/BioJulia/Libz.jl/issues/7.

Also, I think the function countlines is mis-categorized above. It does take a filename.

I think it would be useful to have some facility for using filenames in place of streams that is conventient and semi-automatic, for use both in writing modules and user-facing APIs. In this case, the second line of the documentation for countlines

  countlines(io::IO, eol::Char='\n')

  Read io until the end of the stream/file and count the number of lines. To
  specify a file pass the filename as the first argument. EOL markers other
  than '\n' are supported by passing them as the second argument.

would be redundant because there would be a facility for doing this for all functions. This is analogous to non longer mentioning that sin also operates arraywise, because it doesn’t any longer, and you can still easily get this behavior. There are some important details to implement, like how to choose the mode when opening.

Stefan commented somewhere that some people will never use the do block syntax to open files, not matter how often you ask them to. Obviating this problem would be another advantage. One suggestion in that thread was to implement println(open("filename","w")!, x,...), although there are good reasons not to used ! for this purpose. I like fw"filename" as an alternative. It was decided not to pursue the issue until after v1.0 is out. Still it could be implemented in a module, and using a macro, to try out ideas.


#12

I agree with most everything you write (incl a mechanism that would then make filename as argument redundant, because it is not so orthogonal). yes, my list had some bugs in it. discovered more. but it is no longer important.

for my students, for writing near-oneliners, I am going to recommend the println( open("filename",w"), ... ); gc(), which is not ideal, but nicely short and succinct. For reading, I can even recommend the same without the gc()—the garbage collector will eventually close it, and there is no harm done. it was important to know that this is default behavior: https://github.com/JuliaLang/julia/issues/26476 . I was not sure.

the open()! would indeed be even nicer.

The do...end syntax is too verbose for my taste for one-liners. it is good for multi-liners. then again, I understand why it is not ubiquitous: it adds another level of indentation, and when indentations multiply, this can make code harder to digest.

/iaw


#13

Are you talking about reference counting specifically, or is closing the stream when the object gets garbage collected sufficient? See also #7721 in which I proposed this syntax:

input = open("input.txt", "r")!
output = open("output.txt", "w")!
...
# close(input) when it goes out of scope
# close(output) when it goes out of scope

In top-level scope this syntax would do nothing and just allow the GC to do it’s thing (that is when a global object goes out of scope, after all). This is quite convenient since it allows pasting code from a function body which in that context closes handles upon function exit (or at the end of a loop body or whatever) and have it work without modification in the REPL. The current open(path) do io ... end construct is far more annoying since you can’t paste line-by-line into the REPL without changing it to io = open(path). Letting file handles (and other objects requiring finalization) at the top-level stay open until they are collected is not a big concern since the number of them is proportional to the amount of code entered at the REPL, which is presumably small.


#14

I really meant no way to close the stream when it goes out of scope. It will be closed sometime after it goes out of scope. In saying “no good way”, I was implicitly referring to statements like

Finalizers are inefficient and unpredictable.

Your reply clarifies the situation and the proposal sounds reasonable… It might be nice to know that an output buffer is flushed when the identifier goes out of scope at the REPL, but before the garbage collector runs.


#15

that would be fantastic, indeed.


#16

Yes, this would be really great. Unfortunately, without reference counting there’s no way to accomplish that and there’s no obvious way to add reference counting to the language without wrecking performance.


#17

I am not a language designer and I do not know the julia internals. so forgive my ignorance.

I thought that when a file pointer goes out of scope, the compiler could know this immediately. it cannot deallocate what the pointer points to, because until a gc, it cannot know whether some other pointer still points there, too. (with ref counting, it would know immediately, but julia is gc and not refcounting based. makes sense.)

so if the compiler/runtime does get a ping whenever a file object pointer disappears, calling a flush but not a full gc(), should be cheap. otherwise, this is too painful.

is my understanding far off the mark?


#18

That seems to imply performing some kind of action every time a variable goes out of scope, which would be very costly even if it was not a full gc sweep. Ref counting is just a couple of integer ops on scope exit and that’s already prohibitive enough that you never see ref counting in high performance languages. It possible that there’s some way to do this but it’s not clear how.


#19

all makes sense.

for most simple uses, a gc() is enough, because it forces all open files to flush, which will have to be done sooner or later anyway. for a moment, I wondered whether we should have a flushall(), but searching for all write-open files is probably almost as expensive as the gc() itself. and for not-simple uses, it’s not the right approach anyway.