The documentation entries for write
and print
seem very similar. There are a few minor differences (write
allows file names as well as I/O streams and returns the number of bytes written, while print
calls show
if the argument doesn’t have a canonical text representation), but it seems like they would easy to combine into a single function. What is the motivation for having both? Is there a fundamental difference between their behavior that I’m not noticing?
print
outputs a text representation of an object, whereas write
outputs raw bytes. Try x=7310302560386184563
and look at println(x)
versus write(stdout, x); println()
to see the difference.
Holy smokes. That was well played. I just assumed that write
would output similar text to writedlm
and writecsv
, based on the the name similarity.
Given the drastic difference in output, would it be clearer to rename those functions as printdlm
and printcsv
?
very clever, steven.
There will be another difference in v0.7/v1.0, if you have strings with different encodings:
print*
outputs strings in UTF-8 format (hopefully that can changed so that the desired output encoding can be selected), no matter how the string is encoded.
This is important if you have strings with different encodings that you want to print out together.
write
outputs strings in their “native” encoding, without conversions.
If you have a UTF-8 encoding string, that is what gets output.
If you have ISO-8859-1 (i.e. Latin-1) then that gets output directly. Same for UTF-16 or UTF-32, they will be output (in the native byte order) as a set of 2 byte or 4 byte words.
print
only outputs UTF-8 for the built-in IO types which are defined to be UTF-8 encoded. In order to support I/O defaulting to other encodings, the appropriate approach is to define new IO subtypes with different default encodings. Nothing in Base needs to be changed for this as far as I’m aware.
The general principle is that for print
the output stream determines the encoding (UTF-8 by default for all built-in IO types); for write
the object itself determines the raw data, including the encoding. Endianness may be an exception to that, but again, I/O types with different endianness can and should be added outside of Base.
OK. I was wondering if it would make sense to add the character set encoding in the IOContext
.
I don’t think it would be an exception, at least for what I’ve implemented for different endian encodings, it would not be necessary.
Yes, I’ve been playing with patch along these lines, with a function
"""
textencoding(io::IO)
Returns the encoding (a subtype of [`AbstractTextEncoding`](@ref)) used
for reading/writing text via [`print`](@ref) in the stream `io`.
Defaults to [`UTF8Encoding`](@ref), but can be changed by setting
the `:textencoding` property of an [`IOContext`](@ref). Other encodings
may be defined/supported by external packages.
"""
textencoding(io::IO) = get(io, :textencoding, UTF8Encoding)::Type{<:AbstractTextEncoding}
# dispatch based on encoding:
print(io::IO, c::AbstractChar) = _print(io, textencoding(io), c)
print(io::IO, s::AbstractString) = _print(io, textencoding(io), s)
(Particular IO
types could also overload textencoding
this way, rather than relying exclusively on IOContext
.)
The reason that I’m thinking of having it return a type, rather than an instance of a type, is that it looks like it will be useful for all the encoding types to be abstract
so that they can be subtyped arbitrarily. If Encoding1 <: Encoding2
, that means that any valid text in Encoding2
has the same meaning in Encoding1
, but not vice versa. So, for example, UTF8Encoding <: ASCIIEncoding
. You want to be able to have _print(io::IO, encoding::Type{<:ASCIIEncoding}, c::ASCIIChar) = write(io, UInt8(c))
, for example.
Please take a look at what I’ve already have working in Strs.jl, it seems like you are reinventing the wheel.
A simply type hierarchy will simply not be flexible enough to handle all the different issues with character sets, encodings, and character set encodings.
It’s important to keep track separately of things like the character set from the character set encoding:
For example, UTF16CSE
(character set encoding) is a type with two parameters, the character set CharSet{:UTF32}
, and the encoding Encoding{:UTF16}
. This makes it easy to dispatch appropriately, and add new types and traits in the future.
I think that the character set should be determined by the AbstractChar
or AbstractString
type.
If you’d looked at what I’ve already implemented, you’d see that I already do that.
(Remember, I implemented AbstractChar
before it was put in Base)
Since String
and Char
are not part of the Str
and CodePoint
types, and for convenience,
I have cse
and charset
functions that return the character set encoding and character set respectively.