Difference between `write` and `print`

The documentation entries for write and print seem very similar. There are a few minor differences (write allows file names as well as I/O streams and returns the number of bytes written, while print calls show if the argument doesn’t have a canonical text representation), but it seems like they would easy to combine into a single function. What is the motivation for having both? Is there a fundamental difference between their behavior that I’m not noticing?

1 Like

print outputs a text representation of an object, whereas write outputs raw bytes. Try x=7310302560386184563 and look at println(x) versus write(stdout, x); println() to see the difference.

8 Likes

Holy smokes. That was well played. I just assumed that write would output similar text to writedlm and writecsv, based on the the name similarity.

Given the drastic difference in output, would it be clearer to rename those functions as printdlm and printcsv?

1 Like

very clever, steven. :slight_smile:

1 Like

There will be another difference in v0.7/v1.0, if you have strings with different encodings:

print* outputs strings in UTF-8 format (hopefully that can changed so that the desired output encoding can be selected), no matter how the string is encoded.
This is important if you have strings with different encodings that you want to print out together.

write outputs strings in their “native” encoding, without conversions.
If you have a UTF-8 encoding string, that is what gets output.
If you have ISO-8859-1 (i.e. Latin-1) then that gets output directly. Same for UTF-16 or UTF-32, they will be output (in the native byte order) as a set of 2 byte or 4 byte words.

print only outputs UTF-8 for the built-in IO types which are defined to be UTF-8 encoded. In order to support I/O defaulting to other encodings, the appropriate approach is to define new IO subtypes with different default encodings. Nothing in Base needs to be changed for this as far as I’m aware.

The general principle is that for print the output stream determines the encoding (UTF-8 by default for all built-in IO types); for write the object itself determines the raw data, including the encoding. Endianness may be an exception to that, but again, I/O types with different endianness can and should be added outside of Base.

2 Likes

OK. I was wondering if it would make sense to add the character set encoding in the IOContext.

I don’t think it would be an exception, at least for what I’ve implemented for different endian encodings, it would not be necessary.

Yes, I’ve been playing with patch along these lines, with a function

"""
    textencoding(io::IO)

Returns the encoding (a subtype of [`AbstractTextEncoding`](@ref)) used
for reading/writing text via [`print`](@ref) in the stream `io`.

Defaults to [`UTF8Encoding`](@ref), but can be changed by setting
the `:textencoding` property of an [`IOContext`](@ref).  Other encodings
may be defined/supported by external packages.
"""
textencoding(io::IO) = get(io, :textencoding, UTF8Encoding)::Type{<:AbstractTextEncoding}

# dispatch based on encoding:
print(io::IO, c::AbstractChar) = _print(io, textencoding(io), c)
print(io::IO, s::AbstractString) = _print(io, textencoding(io), s)

(Particular IO types could also overload textencoding this way, rather than relying exclusively on IOContext.)

The reason that I’m thinking of having it return a type, rather than an instance of a type, is that it looks like it will be useful for all the encoding types to be abstract so that they can be subtyped arbitrarily. If Encoding1 <: Encoding2, that means that any valid text in Encoding2 has the same meaning in Encoding1, but not vice versa. So, for example, UTF8Encoding <: ASCIIEncoding. You want to be able to have _print(io::IO, encoding::Type{<:ASCIIEncoding}, c::ASCIIChar) = write(io, UInt8(c)), for example.

Please take a look at what I’ve already have working in Strs.jl, it seems like you are reinventing the wheel.
A simply type hierarchy will simply not be flexible enough to handle all the different issues with character sets, encodings, and character set encodings.
It’s important to keep track separately of things like the character set from the character set encoding:
For example, UTF16CSE (character set encoding) is a type with two parameters, the character set CharSet{:UTF32}, and the encoding Encoding{:UTF16}. This makes it easy to dispatch appropriately, and add new types and traits in the future.

I think that the character set should be determined by the AbstractChar or AbstractString type.

If you’d looked at what I’ve already implemented, you’d see that I already do that.
(Remember, I implemented AbstractChar before it was put in Base)
Since String and Char are not part of the Str and CodePoint types, and for convenience,
I have cse and charset functions that return the character set encoding and character set respectively.