Problems with deprecations of islower, lowercase, isupper, uppercase

If you want to make a fork of FemtoCleaner that implements this change and then fork the Julia parser to implement the deprecation, please go for it. Otherwise frankly, this isn’t going to get done. Of course even then, it might not get accepted as a change.

Like I already said, I don’t think that that is the best way forward.

  1. Introduce the cleaner syntax for interpolation, hex and Unicode literals, leaving the legacy ones in place,
    (like my F"…" string macro in GitHub - JuliaString/StringLiterals.jl: Implement improved string literals with Swift-style syntax for interpolation, hex, & unicode characters, plus C & Python style formatting and Unicode, HTML, LaTeX, and Emoji entities (except without the ‘F’),
    along with a “pure” string syntax (i.e. like my f"…" formatted string macro)
  2. Use the new syntax as much as possible in Base and stdlib, at least for all ‘$(…)’, where it would make no difference in the length of the lines.
  3. Possibly add an optional deprecation mode that warns the first time, if the legacy sequences are used.
  4. Make FemtoCleaner able to do the conversion automatically.
  5. Maybe for 2.0, actually deprecate the legacy sequences, making “…” the same as f"…", if by that time people agree that it’s better to move forward with just the cleaner syntax, like Swift has done.

#1 is actually very simple - and could be done quickly, the code already exists in my package.

If that’s the plan then none of this needs to be done for 1.0 and the only thing that needs to be done now is the blacklist change for $ interpolation.

No, I still think it would be better for 1.0 to already have the Swift style escape sequences built in, along with the f"…" version that doesn’t require $ to be quoted any more.
That would not be breaking at all, and would be advantageous to people dealing with lots of $s in their strings.

Another thought (that won’t work currently, because the parser doesn’t allow defining a macro with the name $_str, would allow the $identifier form that Jeff loves for it’s being shorter). I’ve seen that in some other languages, where string literals need to prefixed with $ allow $ for interpolation in the string.

The fact that it’s not breaking at all is precisely why it does not need to be in 1.0. We have more work than we can do for the already-past-due feature freeze. New changes are not happening unless they are absolutely critical.

I’m thinking more about how to make the language better for all the people who may look at it again, when it finally gets that v1.0 label.
I’d rather have things in a better state, and not risk people being turned away from Julia.
Of course, that would benefit you, and the other shareholders in JuliaComputing :slight_smile:

I can assure you that shares in Julia Computing are not my main motivation for wanting Julia to succeed. I really don’t think our current string interpolation syntax is so horrible that anyone is going to reject the language because of it.

5 Likes

It’s the other string handling issues, that I feel will cause people to unfortunately reject Julia.

No need to worry, I fixed all of those.

Actually, the performance has taken a nosedive, and lots of things have broken…

This thread is becoming a litany of things that annoy Scott P Jones rather than having a specific focus.

You keep saying that string performance has gotten worse with zero evidence, whereas I’ve shown benchmarks that are the same or better. At this point this just seems like FUD you’re putting out there because you disapprove of some of the changes. (Which perhaps not coincidentally revert a lot of changes you made a few years ago, as I’ve already pointed out.) Either provide some real benchmarks or please stop making baseless claims.

Regarding things being broken – yes, that’s what happens when you make breaking changes. Packages defining string types will have to adjust.

I’m preparing the evidence (and a fix to the mess).
Here’s just one of the things that has killed performance:

13:03 $ julia7
               _
   _       _ _(_)_     |  A fresh approach to technical computing
  (_)     | (_) (_)    |  Documentation: https://docs.julialang.org
   _ _   _| |_  __ _   |  Type "?help" for help.
  | | | | | | |/ _` |  |
  | | |_| | | | (_| |  |  Version 0.7.0-DEV.3108 (2017-12-19 11:51 UTC)
 _/ |\__'_|_|_|\__'_|  |  Commit c16c0cb* (0 days old master)
|__/                   |  x86_64-apple-darwin17.3.0

julia> f(ch::Char) = UInt32(ch)
f (generic function with 1 method)

julia> g(codepoint::UInt32) = Char(codepoint)
g (generic function with 1 method)

julia> @code_native f(' ')
	.section	__TEXT,__text,regular,pure_instructions
; Function f {
; Location: REPL[1]:1
; Function Type; {
; Location: sysimg.jl:124
; Function convert; {
; Location: REPL[1]:1
	pushq	%rax
	testl	%edi, %edi
	js	L13
;}
; Function convert; {
; Location: char.jl:28
; Function >>; {
; Location: int.jl:415
; Function >>; {
; Location: int.jl:409
	shrl	$24, %edi
	jmp	L167
;}}
; Location: char.jl:29
; Function leading_ones; {
; Location: int.jl:378
; Function ~; {
; Location: int.jl:261
L13:
	movl	%edi, %eax
	notl	%eax
;}
; Function leading_zeros; {
; Location: int.jl:342
	lzcntl	%eax, %r9d
;}}
; Location: char.jl:30
; Function trailing_zeros; {
; Location: int.jl:354
	tzcntl	%edi, %eax
;}
; Function &; {
; Location: int.jl:277
	andl	$56, %eax
;}
; Location: char.jl:31
; Function ==; {
; Location: promotion.jl:386
	cmpl	$1, %r9d
	sete	%r8b
;}
; Function +; {
; Location: int.jl:43
	leaq	(%rax,%r9,8), %rsi
;}
; Function >; {
; Location: operators.jl:250
; Function <; {
; Location: int.jl:39
	cmpq	$32, %rsi
	seta	%dl
;}}
; Function |; {
; Location: bool.jl:43
	orb	%r8b, %dl
;}
; Function &; {
; Location: int.jl:277
	movl	%edi, %esi
	andl	$12632256, %esi         ## imm = 0xC0C0C0
;}
; Function xor; {
; Location: int.jl:294
	xorl	$8421504, %esi          ## imm = 0x808080
;}
; Function >>; {
; Location: int.jl:415
; Function >>; {
; Location: int.jl:409
	cmpq	$31, %rax
	seta	%r8b
	shrxl	%eax, %esi, %esi
;}}
; Function !=; {
; Location: operators.jl:158
; Function ==; {
; Location: int.jl:399
; Function ==; {
; Location: promotion.jl:327
; Function ==; {
; Location: promotion.jl:386
	testl	%esi, %esi
	sete	%cl
;}}}}
; Function |; {
; Location: bool.jl:43
	orb	%r8b, %cl
;}
	xorb	$1, %cl
	orb	%dl, %cl
	jne	L171
; Location: char.jl:34
; Function >>; {
; Location: int.jl:415
; Function >>; {
; Location: int.jl:409
	xorl	%ecx, %ecx
	cmpl	$31, %r9d
	movl	$4294967295, %edx       ## imm = 0xFFFFFFFF
	shrxl	%r9d, %edx, %edx
	cmoval	%ecx, %edx
;}}
; Function &; {
; Location: int.jl:277
	andl	%edi, %edx
	cmpq	$31, %rax
;}
; Location: char.jl:35
; Function >>; {
; Location: int.jl:415
; Function >>; {
; Location: int.jl:409
	shrxl	%eax, %edx, %edi
	cmoval	%ecx, %edi
;}}
; Location: char.jl:36
; Function &; {
; Location: int.jl:277
	movl	%edi, %eax
	andl	$127, %eax
;}
; Function >>; {
; Location: int.jl:415
; Function >>; {
; Location: int.jl:409
	movl	%edi, %ecx
	shrl	$2, %ecx
	andl	$8128, %ecx             ## imm = 0x1FC0
;}}
; Function |; {
; Location: int.jl:293
	orl	%eax, %ecx
;}
; Function >>; {
; Location: int.jl:415
; Function >>; {
; Location: int.jl:409
	movl	%edi, %eax
	shrl	$4, %eax
	andl	$520192, %eax           ## imm = 0x7F000
	shrl	$6, %edi
	andl	$33292288, %edi         ## imm = 0x1FC0000
;}}
; Function |; {
; Location: int.jl:293
	orl	%eax, %edi
	orl	%ecx, %edi
;}}}
L167:
	movl	%edi, %eax
	popq	%rcx
	retq
; Function Type; {
; Location: sysimg.jl:124
; Function convert; {
; Location: char.jl:31
L171:
	movabsq	$malformed_char, %rax
	callq	*%rax
	ud2
	nopl	(%rax)
;}}}

julia> @code_native g(0x20)
ERROR: no unique matching method found for the specified argument types
Stacktrace:
 [1] error at ./error.jl:33 [inlined]
 [2] which(::Any, ::Any) at ./reflection.jl:934
 [3] _dump_function(::Any, ::Any, ::Bool, ::Bool, ::Bool, ::Bool, ::Symbol, ::Bool, ::Base.CodegenParams) at ./reflection.jl:804
 [4] _dump_function at ./reflection.jl:798 [inlined] (repeats 2 times)
 [5] code_native(::Base.TTY, ::Any, ::Any, ::Symbol) at ./reflection.jl:862
 [6] code_native(::Any, ::Any, ::Symbol) at ./reflection.jl:864 (repeats 2 times)
 [7] top-level scope

julia> @code_native g(0x00020)
	.section	__TEXT,__text,regular,pure_instructions
; Function g {
; Location: REPL[2]:1
; Function Type; {
; Location: sysimg.jl:124
; Function convert; {
; Location: REPL[2]:1
	pushq	%rax
	cmpl	$127, %edi
	jbe	L78
;}
; Function convert; {
; Location: char.jl:42
	cmpl	$2097151, %edi          ## imm = 0x1FFFFF
	ja	L109
; Location: char.jl:43
; Function &; {
; Location: int.jl:277
	movl	%edi, %eax
	andl	$63, %eax
	movl	%edi, %ecx
	andl	$4032, %ecx             ## imm = 0xFC0
;}
; Function |; {
; Location: int.jl:293
	leal	(%rax,%rcx,4), %eax
;}
; Location: char.jl:45
	cmpl	$2047, %edi             ## imm = 0x7FF
	jbe	L83
; Location: char.jl:43
; Function <<; {
; Location: int.jl:417
; Function <<; {
; Location: int.jl:410
	movl	%edi, %ecx
	shll	$4, %ecx
;}}
; Function &; {
; Location: int.jl:277
	andl	$4128768, %ecx          ## imm = 0x3F0000
;}
; Function |; {
; Location: int.jl:293
	orl	%ecx, %eax
;}
; Location: char.jl:45
	cmpl	$65535, %edi            ## imm = 0xFFFF
	jbe	L95
; Location: char.jl:43
; Function <<; {
; Location: int.jl:417
; Function <<; {
; Location: int.jl:410
	shll	$6, %edi
;}}
; Function &; {
; Location: int.jl:277
	andl	$251658240, %edi        ## imm = 0xF000000
;}
; Location: char.jl:45
; Function |; {
; Location: int.jl:293
	orl	%eax, %edi
	orl	$4034953344, %edi       ## imm = 0xF0808080
;}
	jmp	L105
; Location: char.jl:41
; Function <<; {
; Location: int.jl:417
; Function <<; {
; Location: int.jl:410
L78:
	shll	$24, %edi
	jmp	L105
;}}
; Location: char.jl:45
; Function <<; {
; Location: int.jl:417
; Function <<; {
; Location: int.jl:410
L83:
	shll	$16, %eax
;}}
; Function |; {
; Location: int.jl:293
	orl	$3229614080, %eax       ## imm = 0xC0800000
	movl	%eax, %edi
;}
	jmp	L105
; Function <<; {
; Location: int.jl:417
; Function <<; {
; Location: int.jl:410
L95:
	shll	$8, %eax
;}}
; Function |; {
; Location: int.jl:293
	orl	$3766517760, %eax       ## imm = 0xE0808000
	movl	%eax, %edi
;}}}
L105:
	movl	%edi, %eax
	popq	%rcx
	retq
; Function Type; {
; Location: sysimg.jl:124
; Function convert; {
; Location: char.jl:42
L109:
	movabsq	$code_point_err, %rax
	callq	*%rax
	ud2
	nopl	(%rax,%rax)
;}}}

Here are the same functions just before #24999 was merged:

julia> @code_native f(' ')
	.section	__TEXT,__text,regular,pure_instructions
; Function f {
; Location: REPL[1]:1
	movl	%edi, %eax
	retq
	nopw	%cs:(%rax,%rax)
;}

 julia> @code_native g(0x00020)
	.section	__TEXT,__text,regular,pure_instructions
; Function g {
; Location: REPL[2]:1
	movl	%edi, %eax
	retq
	nopw	%cs:(%rax,%rax)
;}

i.e. no-ops, they get totally compiled away when used in a function.
These are used heavily when doing processing on strings… I don’t see how you didn’t see how greatly this affects performance, even back two years ago, when I first evaluated your branch on changing the representation of Char.

1 Like

Yes. Up until #24999, we were able to avoid the problems caused by the switch to String in v0.5, by using LegacyStrings, however now that is broken, along with everything else in our code (that does a lot of string processing, to analyze unstructured data) that was written generically, using Char.

I’ve repeatedly pointed out that it is almost never actually necessary to convert a character to an integer code point value. You can do both equality and inequality comparisons in the new character representation without converting to code points:

julia> isdigit(c::Char) = '0' <= c <= '9'
isdigit (generic function with 1 method)

julia> isdigit('5')
true

julia> @code_native isdigit('5')
	.section	__TEXT,__text,regular,pure_instructions
; Function isdigit {
; Location: REPL[1]:1
	cmpl	$805306368, %edi        ## imm = 0x30000000
	jae	L11
	xorl	%eax, %eax
	retq
; Function <=; {
; Location: operators.jl:273
; Function |; {
; Location: bool.jl:43
L11:
	cmpl	$956301313, %edi        ## imm = 0x39000001
	setb	%al
;}}
	retq
	nopw	%cs:(%rax,%rax)
;}

That’s the exact same amount of work that doing the same character comparisons took in previous Julia versions – the machine code is literally identical, with slightly different constant values. If you’re computing string predicates by converting to code points values, stop doing that. Use character comparisons instead.

Converting a Char value to an integer code point value is more work than it used to be. But when processing String values, you had to do that work on 0.6 anyway – you just did it before you could get a Char value in the first place. Even if you are iterating a String and then converting each Char to a code point (which you almost certainly shouldn’t be doing), you are still doing no more work in all than you were on 0.6. In fact the total time to do this operation has gotten faster on 0.7:

words = readlines("/usr/share/dict/words");

function sum_code_points(words::Vector{String})
    t = 0
    for word in words, c in word
        t += Int(c)
    end
    return t
end

# 0.6.1
julia> @time sum_code_points(words)
  0.010826 seconds (5 allocations: 176 bytes)
242397669

# 0.7-DEV
julia> @time sum_code_points(words)
  0.007645 seconds (129 allocations: 9.912 KiB)
242397669

The only situations where any of this is a problem is if you’re working with non-UTF-8 string encodings. That is increasingly uncommon situation and we can introduce an AbstractChar type so that you can represent characters however you want to and avoid converting to the new representation if it’s a bottleneck. But as I’ve already pointed out elsewhere, when implementing fast operations on a particular string type, you generally don’t do it in terms of characters, you manually specialize the operation for a specific encoding (as you yourself have done in many places).

2 Likes

And that’s a total fallacy.
We use lots of table lookups, masking operations, etc.

Only if you were using String type, which we never used, because the performance is so much worse for doing a lot of string processing than using Latin1, UCS2, and UTF32 (like Python does!).

Perhaps this is the crux of where you differ from rest of the community: ultimately we have made a decision to make the primary string type in Julia be UTF-8, and that performance should be optimised for this use case. Despite what it may appear, this was not a unilateral decision by Stefan, he just has a stronger stomach for arguing its merits.

3 Likes

And that was a very unfortunate decision - leading to unnecessarily poor performance, which is rather surprising in a language that is trying to make a name for itself for having great performance.

Within a week, I should have a first version of my string package ready, which I think will convince people that the current string architecture of Julia is not the way to go in the future.

Lookup tables and masking operations are still entirely possible and only slightly more complicated.

Only if you were using String type, which we never used, because the performance is so much worse for doing a lot of string processing than using Latin1, UCS2, and UTF32 (like Python does!).

The main problem here seems to be your insistence on not using Julia’s default String type. Your obsession with antiquated string encodings is almost as well known as your 30+ years of industry experience. In the real world today, however, people use UTF-8 everywhere. 90% of the web is UTF-8. (And growing. UCS-2 and UTF-32 are not even a significant part of the other 10%.) All UNIX operating systems use UTF-8. Windows is UTF-8 these days aside from legacy APIs. UTF-8 won. Insisting on using other encodings is going to cause you some pain.

The way Python represents strings internally is designed around one requirement: O(1) character indexing. This was necessary because Python has a huge legacy code base that assumes it. Otherwise, this design is not ideal for performance: you potentially need to transcode every string that passes through the language twice – once on the way in and then again on the way out. Even if you don’t have to actually transcode strings, you still have to at least look at every byte to decide if you have to transcode it or not – which is already a performance problem. I don’t mean to speak ill of Python’s string design – because they had very valid reasons to do it that way – but it is not the best example of modern, high-performance string handling. For that, you’ll want to look at Go, Rust or Swift, all of which are heavily UTF-8 by default.

5 Likes