Solution for issue #25216, larger octal literals produce smaller types, sometimes

In reference to this issue raised by @iamed2 (https://github.com/JuliaLang/julia/issues/25216),
regarding the comment by @jeff.bezanson (https://github.com/JuliaLang/julia/issues/25216#issuecomment-353356186), I believe there is a simple solution.

function os(digits)
          len = length(digits)*3
          r = len & 7
          len>>>3 + (r == 1 ? (digits[1] > '3') : r == 2 ? (digits[1] > '1') : 1)
       end

This will correctly calculate the number of bytes required, while still allowing extra zeros, if not stradling a byte boundary, to make the returned type larger, while avoiding strange stuff like 0o000 returning 0x0000, but 0o377 returning 0xff

julia> for str in ("0", "00", "40", "000", "040", "377", "400", "0000", "0400", "00000", "000000") ; println(str, repeat(" ", 8-length(str)), os(str), "  ", typeof(Meta.parse("0o" * str))) ; end
0       1  UInt8
00      1  UInt8
40      1  UInt8
000     1  UInt16
040     1  UInt16
377     1  UInt8
400     2  UInt16
0000    2  UInt16
0400    2  UInt16
00000   2  UInt16
000000  2  UInt32
1 Like

In reference to the comment: https://github.com/JuliaLang/julia/issues/25216#issuecomment-353567052

The following perfectly well constructed string with octal constants fails in Julia, because of this issue:

julia> String([0o150, 0o145, 0o154, 0o154, 0o157, 0o054, 0o040, 0o127, 0o167, 0o162, 0o154, 0o144, 0o041])
ERROR: MethodError: Cannot `convert` an object of type Array{UInt16,1} to an object of type String
This may have arisen from a call to the constructor String(...),
since type constructors fall back to convert methods.
Stacktrace:
 [1] String(::Array{UInt16,1}) at ./sysimg.jl:77

You need to explicitly force them to all be UInt8, as follows, to get the intended result:

julia> String(UInt8[0o150, 0o145, 0o154, 0o154, 0o157, 0o054, 0o040, 0o167, 0o157, 0o162, 0o154, 0o144, 0o041])
"hello, world!"

I realize that there’s not much call for using octal literals these days (and Swift even removed them, only the sequence \0 is allowed in string literals, any following digit is taken to be a separate character), but if they are going to be in the language, they should be done in as sane a way as possible. (They made a lot of sense back on the 18 and 36 bit machines that I used back at MIT in the early 80’s, i.e. Dec-10, Dec-20, and Lisp Machines, with 9-bit bytes, but not so much these days!)

Interesting formula. It seems to give reasonable results for small numbers with few leading zeros.
How would I predict the number of required bytes in simple terms (how many leading zeros must I take to force an Int64)?
What is exactly the problem, which is solved?

I verified, that in any case if two octal representations of numbers have the same number of digits (with or without leading zeros), the smaller one never required more bytes that the bigger one.

I found 2 examples, where the formula is on byte too high:

julia> os(oct(0x200000))
4
julia> os(oct(0x200000000000000000000000000000))
16

I found a even simpler solution, which warrants to increase of the binary type for (octal) number literals of the same amount of digits:

  • In case of leading zeros, replace the first 0 by 1 and take the required size for the modified literal.

The current implementation behaved as if 0 had been replaced by the maximal digit of the base, which leads to different results for octals.
It was easy to put that rule into a PR #25259.

I think that it is more important to get a consistent solution, rather than the simplest.
Is your solution consistent with the way leading zeros work with hex constants or binary constants?

edit: I’m not saying that it isn’t, I just haven’t been able to try it yet - if so, very good!

Yes, it is! With “simple” I mean elegant, concise, and adapted to the problem.
Have a look into the source code to see, that the same function is used for oct, bin and hex as well.

Btw, your example works, of course:

julia> String([0o150, 0o145, 0o154, 0o154, 0o157, 0o054, 0o040, 0o127, 0o167, 0o162, 0o154, 0o144, 0o041])
"hello, Wwrld!"

Oops! Typo there, that was supposed to be "hello, world!", of course!
My octal must be rusty! :slight_smile: