Addressing raw string syntax and semantics for Julia 2.0?

But that’s not the only case, right? It emerges anywhere in the string where you’ve got a slash that precedes a quote – not just at the end of the string. Moreover, how can I fix a issue with raw strings using a string macro that depends upon raw strings? One more item, my use cases involve existing string macros that use raw strings, like regex.

julia> print(raw"She said: \"Hello World\"")
She said: "Hello World"

This broken stair moves around quite a bit. The inconsistent behavior is really challenging.

help?> @raw_str

Create a raw string without interpolation and unescaping. The exception is that quotation marks still must be escaped. Backslashes escape both quotation marks and other backslashes, but only when a sequence of backslashes precedes a quote character. Thus, 2n backslashes followed by a quote encodes n backslashes and the end of the literal while 2n+1 backslashes followed by a quote encodes n backslashes followed by a quote character.

  julia> println(raw"\ $x")
  \ $x
  julia> println(raw"\"")
  "
  julia> println(raw"\\\"")
  \"
  julia> println(raw"\\x \\\"")
  \\x \"

If it takes a paragraph and several examples to describe the rules (which apply to regular expression string macro), how do you expect an average user of Julia to become proficient? This is just a broken stair. Let’s not pretend otherwise.

2 Likes

But wouldn’t yet-another quote syntax ' just further shift around this “broken stair?” Or would you prefer it to be impossible for ' strings to include the ' character instead of supporting its escaping?

Personally, I almost never escape strings because — I agree — it’s fiddly; I just use triple quotes instead. Works with raw"""...""" and r"""...""", too. Then you only ever need to worry about final "s or embedded triple-quotes, which are exceedingly rare.

9 Likes

@mbauman This discussion is now arguing a paired delimiter (such as smart quotes “...” ) could offer improved semantics for raw strings, used by regular expressions and other string macros, by negating the need to have complex escaping rules. In this case, the raw string would end when its matching pair is encountered, allowing for nested strings. There will be some strings that cannot be represented, but that would be acceptable for the design since regular double quoted strings have that feature.

1 Like

Are there real world examples where “...” would work and """...""" wouldn’t?

1 Like

Sure, you can freely substitute whatever characters you want into my question; e.g.,

But wouldn’t yet-another quote syntax just further shift around this “broken stair?” Or would you prefer it to be impossible for strings to include the character instead of supporting its escaping?

I take it you’d prefer the latter — but that also sounds like it could be described as a “broken stair.” Sure, it’s a stair with a smaller surface area with a unicode delimiter (and even smaller yet if it’s a matched pair), but it’s also more annoying to write a unicode delimiter in the first place.

Note that you can also (ab)use cmd strings for this — and they too can support triple-quotes:

julia> macro raw_cmd(ex); ex; end
@raw_cmd (macro with 1 method)

julia> println(raw`She said: "His name is O'Brien"`)
She said: "His name is O'Brien"
5 Likes

FYI, I have no idea how I would write this on my keyboard and I sure don’t want to lookup unicode escapement just for writing some escaped string… Maybe I’m in the minority, but I really don’t think importing verbosity for commands from e.g. LaTeX (where I dread writing long command after long command just to render some formulas) is a good fit for this.

Moreover, you don’t need raw strings to have " in a string literal?

julia> a = "Hello \"World\""
"Hello \"World\""

julia> a[7]
'"': ASCII/Unicode U+0022 (category Po: Punctuation, other)

julia> a |> collect
13-element Array{Char,1}:
 'H': ASCII/Unicode U+0048 (category Lu: Letter, uppercase)
 'e': ASCII/Unicode U+0065 (category Ll: Letter, lowercase)
 'l': ASCII/Unicode U+006C (category Ll: Letter, lowercase)
 'l': ASCII/Unicode U+006C (category Ll: Letter, lowercase)
 'o': ASCII/Unicode U+006F (category Ll: Letter, lowercase)
 ' ': ASCII/Unicode U+0020 (category Zs: Separator, space)
 '"': ASCII/Unicode U+0022 (category Po: Punctuation, other)
 'W': ASCII/Unicode U+0057 (category Lu: Letter, uppercase)
 'o': ASCII/Unicode U+006F (category Ll: Letter, lowercase)
 'r': ASCII/Unicode U+0072 (category Ll: Letter, lowercase)
 'l': ASCII/Unicode U+006C (category Ll: Letter, lowercase)
 'd': ASCII/Unicode U+0064 (category Ll: Letter, lowercase)
 '"': ASCII/Unicode U+0022 (category Po: Punctuation, other)

so I’m not even sure why the first tool you’re reaching for in these cases are raw strings…

First, people aren’t proposing to get rid of double quotes "string", and the topic has been revised so that reclaiming single quotes 'c' is no longer under consideration. Rather, the present question is about an additional/alternate termination character or pair, not replacement or removal. So a typical user could go on using double quotes as they always have.

Second, I believe “” is more an example than a serious proposal. As discussed above, such quotes look very similar to "" in some fonts, and so a less ambiguous quote might make more sense to avoid confusion.

Third, the difficulty of using Unicode shouldn’t really factor in here. There are REPL shortcuts such as \alpha, and any presumed alternate termination would presumably come with appropriate shortcut to become more handy. The use case being discussed is admittedly not one you encounter, but some others are bedevilled by tricky escaping. It’s not easy to get rid of escaping altogether, but we could well imagine an (optional/alternate) termination character for Julia that doesn’t often conflict with SQL or regexes. Those who want the convenience would be expected to deal with the pain of typing \specialtermination or configuring the appropriate hotkey in their editor.

Second, I believe “” is more an example than a serious proposal.

Someone making a serious proposal would likely make this conversation more efficacious.

4 Likes

There is an opportunity for the Julia community to better utilize literal string notations, such as regular expressions, to enhance developer experience and improve code maintainability. Other sorts of literal strings include support for well-known string template formats, such as HAML. In specific fields of science, other notations may also be quite helpful, so rather than bugging the Julia community for a new syntax extension… it can be implemented straight-away and launched. For example, often people wish their own matrix format, this could be done as a literal string notation.

A requirement of these custom string literals is that they should not have standard processing, escaping or interpolation. To support these micro languages, Julia founders created the notion of a “raw string”, which string macros, such as regular expressions, inherit. Most of the time, the raw string processing rules do appear to meet their advertised mission, and pass along a string literal to the notation without escaping. This works so well, in fact, that users often get comfortable. Except when they trigger the conditions of the escaping rule that cause unexpected behavior. These rules are certainly not obvious to a newcomer, and they may even escape seasoned Julia developers. The common suggested work-around is to use the triple double-quoted form, but this is extremely heavyweight. In short, the existing mechanism is a burden to those who wish to build and effectively use string literal notations.

What would be an alternative? A paired Unicode character combination that is not currently used seems like it could be a useful way to create a useful “notation” syntax for Julia. There are many advantages.

  1. The parsing algorithm is simple to understand. The Julia parser would know the notation has ended when the number of ending tokens matches the number of opening tokens. This is a learnable rule.

  2. By using a Unicode character pair we are able to directly represent existing ASCII text protocols (single quote, double quote, backslash, forward slash, percent sign, etc.) without the need for escaping.

  3. Paired notation can nest if the notation wished. This could be advantageous if one syntax format used another, or a recursive use of the same notation. Right now to have recursive use of notations, you must drop the notion of a string literal, and use a macro (which like triple double-quoted form) does increase verbosity.

  4. Paired notation could span multiple lines without needing a “tripled” version of the indicators. This would help it be quite succinct.

  5. The notation, by itself without any character in front of it, could be used for “raw strings”. This would complement double-quoted literals.

There are some design considerations.

  1. Should it have escaping? Nope. The notation system itself could define its own escaping if it wishes to. For literals without a prefix, there is always double-quoted strings which can handle any code point with slash escapes.

  2. The tokens should be visually distinct from single and double quotes, right-left brackets, and other operators to avoid visual confusion.

  3. The token characters picked should be single-width so that they do not cause funky indentation in a courier font.

  4. Ideally, they are marked as punctuation in Unicode, yet not conflict with usage patterns in common languages.

  5. What’s not a design consideration is how to enter it, that’s a user interface issue. What’s important is that it is visually attractive.

So, what is a straw-man proposal? I’m not sure. Here’s one that has not been mentioned yet. However, I think finding a paired character combination can happen after we have general consensus on the idea.

'「': Unicode U+FF62 (category Ps: Punctuation, open)
'」': Unicode U+FF63 (category Pe: Punctuation, close)

Well, I don’t like it that much. However, it is visually distinct. It seems to be a cousin of Japanese quotations (「…」) that are not single width.

'⟩': Unicode U+27E9 (category Pe: Punctuation, close)
'⟨': Unicode U+27E8 (category Ps: Punctuation, open)

So… r⟨"[^"]*"![[⟩ ?

Anyway. It seems kinda pointless to go though characters unless there’s even some interest in the path.

3 Likes

How about using string interpolation combined with a macro?

julia> macro dq_str(name)
           return :( "\"" * $name * "\"" )
       end
@dq_str (macro with 1 method)

julia> "select * from conversation_table where greeting = $(dq"Hello")"
"select * from conversation_table where greeting = \"Hello\""

julia> "She said: $(dq"His name is O'Brien")"
"She said: \"His name is O'Brien\""

julia> greeting = dq"Hello"
"\"Hello\""

julia> "select * from conversation_table where greeting = $greeting"
"select * from conversation_table where greeting = \"Hello\""

julia> she_said = dq"His name is O'Brien"
"\"His name is O'Brien\""

julia> "She said: $she_said"
"She said: \"His name is O'Brien\""

I personally find this more readable than a lot of the above examples. Overall, I think macros and string interpolation gives you a lot of flexibility to match your programming style. My main question here is whether there should be some standard macros to help with this so Julia code is easily readable by many people.

6 Likes

Maybe it’s feasible to take the approach to delimiting strings from the sed program? It lets the user choose whatever delimiter is convenient each time, so that no characters need to be escaped at all. In Julia such an approach could look similar to this, for example:

raw"regular string"
raw|delimited by vertical bars|
raw/delimited by slashes/
raw'single quotes'
# etc

Not exactly as shown - it is ambigious parsing. But something along these lines should be possible, I think.

5 Likes

This macro is a bad idea, because now you’ve got trivially to exploit SQL injection:

julia> "select * from conversation_table where greeting = $(dq"Hello\"; select * from secret_table where \"\" = \"")"
"select * from conversation_table where greeting = \"Hello\"; select * from secret_table where \"\" = \"\""

This is why escaping input is necessary.

1 Like

The idea is right though — it’s a huge win if you can restructure your APIs such that you don’t need to worry about manually escaping things. Cf. Shelling Out Sucks and

2 Likes

Regarding paired Unicode delimiters. How about…

'⟪': Unicode U+27EA (category Ps: Punctuation, open)
'⟫': Unicode U+27EB (category Pe: Punctuation, close)

For regular expressions… r⟪"[^"]*"![[⟫

It’s actually quite pretty for nested, multi-line expressions (yes, the htl notation would have to provide $ interpretation as well as delegating to Meta.parse the Julia sub-expressions).

using HypertextLiteral

render(books) = htl⟪
  <table><caption><h3>Selected Books</h3></caption>
  <thead><tr><th>Book<th>Authors<tbody>$(htl⟪
    <tr><td>$(book.name) ($(book.year))<td>$(join(book.authors, " & "))
    ⟫ for b in books)</tbody></table>⟫

It’s kinda like a double quote and a function call in one character. The ability to nest expressions perhaps using other notations is quite fantastic. I could picture mixing sql⟪⟫ on the outside, using something like this for the leafs… the cool part is that anyone could make relevant notations for their task, and tightly integrate it with their code.

The critical part is that Julia doesn’t have to care about what’s between paired ⟪...⟫ it only has to count the enter/exits (delegating the content to a function that returns an Expr when exits matches enters). Everything else can be delegated to the notation, including what subordinate notations it wishes to support. Notations invoked this way can worry about their own escaping, for example, HTML uses ampersand; URLs use the percent sign.

1 Like

I’m not an expert in language design, but from an outside view, this looks like an arms race:

  • Language A: let’s use "string".
  • Language B: let’s use 'string', because we want people to be able to include language-A statements unescaped inside a string. This is useful for SQL, HTML and stuff.
  • Language C: let’s use `string`, so that people can include language-A and language-B statements unescaped inside a string.

This is just waiting for language D to arrive.

This proposal differs from escaping protocols due to the delimiter pairing. Indeed, a subordinate language could use the same paired delimiters, without having to be modified or losing visual effectiveness. You can think of it as bringing to string construction what we already know about function calling and data structure – that they are seldom flat structures. Besides, if it’s an arms race, shouldn’t Julia pull ahead by using a break-though technology that is now mature (Unicode)?

What about subordinate notations that don’t use the chosen delimiter in matching pairs? For example, if we used the double curly braces, r{{...}}, it’d mean that regular expressions couldn’t use either opening or closing delimiters without pairing them – although the embedded regex notation could add a Julia-specific escape for the chosen delimiters. That said most subordinate syntax either use the delimiters in a paired way, or have a way to escape them. The critical part is that the delimiters are paired – both Julia and the users of the syntax need only match delimiter pairs.

How to avoid the chosen delimiter pair is the notation’s concern, not Julia’s. This has precedent in web technologies. In web pages, embedded Javascript begins with <script> and it ends with… </script>. There is no escape from the HTML parser perspective, it recognizes script content till it encounters </script>. Therefore, Javscript developers who want to use <script> as part of a string literal often write the ending tag as "<\/script>". But, back to the main point – in this proposal, the notation system need not represent all code points, if that’s a need of the specific subordinate notation, it can provide its own escaping that is customary and/or congruent with its own textual representation.

I think a paired delimiter would make a lot of sense, definitely more than going to single quotes or triple-double. Imagine a world where parentheses were not paired. For things like DSLs and string macros, a paired delimiter would be much nicer.

I want to thank everyone here for participating in a spirited discussion and I realize we may not agree. Your ideas and critiques have helped this coalesce into an actionable feature request, and I’ve submitted it as #38948 where it can be more formally discussed, and perhaps even accepted. Thank you for your time.

10 Likes

This thread has been locked at @cce’s request. Any further points should be directed to the GitHub issue linked above.

2 Likes