Case-insensitive prefix/suffix stripping

What’s not fun, is it’s kind of broken… I would want chopsuffix and chopprefix to work, well according to my definition, be able to to do e.g.:

chopprefix(s, "http://", "https://")  # here meaning drop/rename http to https

chopprefix(s, ["http://", "https://"])  # in case you want to just drop either

Neither works, but could be added, no breaking change, but my issue is that a straightforward extension is problematic.

As is I can drop “http”, but it will NOT drop “HTTP”, and I checked, neither in Python. But you can expect either. Things get worse.

chopprefix(s, "http://", "https://")
chopprefix(s, "HTTP://", "HTTPS://")

But this would miss some cases such as “hTtP”. I.e. there are exponentially many cases, here 16 to check for if chop functions aren’t made case-insensitive.

That, and status quo, path[begin:end-3] (and chop functions) would be a code-smell. Because it’s a very non-trivial example, I believe it should be in the standard library, with my definition.

If endswith were case-sensitive, it would match “.JL” but would then not strip it, unless / were also case-sensitive, so would you want that? And since it’s an operator you can’t have a keyword-argument. Such should be added to chop functions to get the old non-case-sensitive behavior.

On Linux/Unix the ending would be .jl, most often (because people are used to the file-system case-preserving), you can actually have .JL file ending there too, even two different files with only the .jl vs. .JL distinction… When you run (either) Julia doesn’t care. The shebang controls that Julia is invoked, if not done directly.

That’s unlike on Windows, there you could not have both files, only either, also most likely you would do that on Linux, but you must support .JL too, because the file-ending actually controls what program is run, and on Windows both work. That’s mainly why I would want chopprefix to be case-insensitive.

One “problem”, is if we change the definition, to be wider, is that while that catches more endings and prefixed (good IMHO, so should be done soon, i.e. before the 1.11 code freeze), it would not on 1.10; which likely becomes the next LTS. So you have a documented inconsistency. It’s not the end of the world, Programs would work as is currently done, imperfectly… and while I would say could be backported, I know not wanted…

Another thing, maybe arguably chop should also work across Unicode normal forms. [And maybe do a strip first, if you have a file name and it ends with a space for some reason… this is likely to be most controversial, so maybe not, or at least not be default?]

[One objection I could see is Julia strings are not just for text, they also work for binary, arbitrary illegal UTF-8, I’m just unsure the chop* or startswith functions would be used then. Some files start with a magic cookie “string”, such as:

SQLite format 3

Then you want to match only it exactly, not sqlite… or SQLITE (though it might not be the end-of-the world?). I just think people wouldn’t use chopprefix (for such file content), or even startswith, and, for now at least, I’m not suggesting changing the latter, just arguably is should(?) be consistent.]

For that sort of thing you could use a regex: replace(s, r"^https?://"i => "") instead of chopprefix, for example.

3 Likes

Or chopprefix("hTtPs://abc", r"https?://"i).

3 Likes

I know that’s an alternative (I’m also not your example handles Unicode normal forms), I just know that most will not, it’s complex, people may be scared of (full complete/correct alternative) regexes (and a full PCRE Regex engine is overkill, I want to gone eventually… as a stdlib, Base at least shouldn’t/doesn’t require it), and it doesn’t mean that chop functions can’t be changed. You think people rather want case-sensitive for those? I think most would be ignorant, don’t know case-sensitive might be preferred, e.g. for file endings, and those who know and don’t want it could use a keyword argument for old behavior (well except then making code that would no longer work on 1.10…).

If your argument is that status quo is fast, then with the change it can also be O(n), basically O(1) because most prefixes and suffixes small. You could also first check for an exact match on the fast path. But ruling out all the combinations could start with just the first letter case-sensitive (i.e. 2 cases… presumably, 4 I think max with Unicode normalization), etc. most likely rejecting quickly.*

* What I’m NOT proposing, at least for a default is insensitivity to e.g. Letterlike Symbols - Wikipedia

K  Kelvin sign 212A
Å  Ångström sign 212B
ℬ  Script capital B 212C
ℭ  Black-letter capital C

Changing chopsuffix to be case-insensitive by default would be a breaking change, and so is not going to happen. It would also be inconsistent with all other string comparison functions (==, endswith, replace, …), which are case sensitive by default. (Nor do similar functions in any other mainstream language default to case-insensitivity, as far as I know.)

It would be possible to add an ignorecase=true keyword option to endswith, chopsuffix, etcetera. But for case-insensitive file-extension stripping and comparison, it might be more useful to just call lowercase on the result of splitext. (We could also add a lowercase=true option to splitext to convert the extension to lowercase before returning it. There’s a reasonable argument for making this the default, but again this would be a breaking change so it’s not feasible in Julia 1.x.)

4 Likes

That is an argument (though I feel status quo more “buggy” than code with the change might be). I’m thinking we can get around that.

A problem with that is that if you opt into it the code will not work on 1.10. With it implied it would (sort of).

What you could do is NOT change the default unless you opt into that with:

using Julia2

I did not know of that. It’s good to know of that function if you just want to split off any extension, if only for e.g. .jl then more complex than people will likely opt into. Also no help for prefix like HTTP.

[Another way, is a new string type is on my TODO list… and then all functions would follow its default, and it will preserve case of course, but will support case-insensitive comparison. We want both at the string level, and I would argue it should be the default (so e.g. in Dict’s), it can at least be in MY string type, but I want it adopted later into Julia. The plan is to also support the European ordering rules and more. ASCIIbetical/UTF-8 ordering comparison considered harmful…]

I, for one, find that this discussion is entirely too focused on chopprefix/chopsuffix being used for the specific case of filenames and trying to be too clever. I hope these function stay the way they are: simple to reason about and fast. The cost of dealing with edge cases should be on the hands of the programmer, or in an external package that gives a case-insensitive string (or another solution).

5 Likes

The status quo should certainly be available if you want that. But when do you want it? And when is the preferred default?

I also had an example for prefix/non-file-ending, http, and as in most editors, searching should generally be case-INsensitve, it’s only that the other option is slightly simpler, why done historically (and can be argued for searching long strings, as opposed to small pre- or suffixes)

I disagree (how difficult is “allows upper and lower case” to reason about? And more generally, treat letters that look and mean the same, in the same way), you assume there’s some runtime cost (also those functions rarely speed-critical), I care more about the mental cost, and lost opportunity, the examples a was to strip file ending were ALL problematic. If we want to support that (e.g. for Windows) well, the default should be sane for that, and easy to get, not needing to add a package, at best a keyword-argument.

Note, e.g. in SQL not always case-sensitive, e.g. SQL Server, different culture, since case-insensitivy (for filenames) there.

Many languages do not have cases even, so all Arabic etc. speakers must learn it, get used to it, and not at all clear they will think of it much, will more likely just make non-ideal code, as even English speakers do…

On macOS Unicode normalization is different in the file-system from all others, so if I were to make up my own ending (.Páll, I would not, since it’s a trap), the code would likely not be portable. It’s best if the stdlib hides platform differences, it’s one of its jobs. If the functions do the same (non-ideal) thing on all platforms, they might as well not be in the stdlib. I’m more surprised Python and others didn’t choose a good default.

Knowing that a case-insensitive search for "string" is spelled r"string"i is just one of those things that people need to learn. Case isn’t a property of strings, it’s a character property, and invisibly conflating these things is not great.

The generalized rules around case-folding and titlecasing are rather complex, and doing it properly is locale sensitive. Unicode provides one standard for it, with PCRE follows, but that’s all it is, one standard. The Turkish language has another one, and it isn’t the only example, just the easiest to bring to mind.

Julia strings aren’t even guaranteed to be encoded correctly, they’re just a byte array which is hopefully UTF-8. Operations on strings should respect that, and be explicit about what they’re doing. Silently introducing case-sensitivity to string matching is the opposite of that.

2 Likes

Yes one of (why bad to have to learn multiple things for a “simple” task). I thought of making a doc PR fix (maybe least bad/simple option) for (just most general example, anyone know how, to make a PR?):

and such does not work for the: Precomposed character - Wikipedia

common Swedish surname Åström written in the two alternative methods […]

  1. Åström (U+00C5 U+0073 U+0074 U+0072 U+00F6 U+006D)
  2. Åström (U+0041 U+030A U+0073 U+0074 U+0072 U+006F U+0308 U+006D)

[…] especially if they are more exotic, as in the following example (showing the reconstructed Proto-Indo-European word for “dog”):

  1. ḱṷṓn (U+1E31 U+1E77 U+1E53 U+006E)
  2. ḱṷṓn (U+006B U+0301 U+0075 U+032D U+006F U+0304 U+0301 U+006E)

[One more problem is a potential byte-order mark, at the beginning of strings, at least files… Arguably could be checked for and always stipped.]

I agree that it’s under-documented, but Unicode.normalize should be used on any string which might have composed characters in it, that’s the only hope of getting a consistently-correct result. Unicode is inherently difficult to work with, no avoiding that.

2 Likes

This code works fine for chopping off http://, even if the subsequent string contains composed characters. (There are no precomposed characters based on /, so that separator is normalization-independent.) You don’t need to worry about Unicode normalization so much if you are only searching for ASCII substrings in typical data parsing tasks (URL splitting, filename extensions, JSON parsing, normal XML parsing with ASCII tags, etcetera).

I agree that it’s a good idea to have some awareness of Unicode normalization if you are working with non-ASCII strings!

But normalizing everything that might have composed characters could be overkill. A lot of code working with Unicode strings deals with it as a sequence of opaque data blobs (perhaps separated by ASCII as in XML text) — a classic example of this is filenames (where you might look at an ASCII suffix, or split at an ASCII path separator, but otherwise normally treat path components as opaque user-specified blobs). In fact, on Linux ext4 filesystems no encoding is specified — the filename is just a string of bytes — so normalizing a filename could change it to refer to a different file.

1 Like

Agreed, I elided “when you’re searching for characters which might be present in one of several forms” there, assumed it in context. I’m glad Julia doesn’t prevalidate Strings, let alone pre-normalize them. I don’t like paying tax when I don’t get services.

2 Likes

Not done, since already possible with a regex, all along (I didn’t notice/think of it at first), and adding a keyword, while possible, is worse in one way, for code working on 1.10 (LTS).

My PR, first mean for documenting only, then adding 2-3 optimiztions, seems done (I can’t figure out why docstring error, it seems to be legal syntax, see at PR);

[I believe I’ve addressed all objections, including from @stevengj of the PR, there, after posting here. I’m still confused about invalid escapeing in line 292, is someone can clarify that. False alarm, the code seems to work, why is it objecting to a doc comment?

It can be a bit of a drag to read CI, e.g. for this PR, for non-x86:

>Creating usr/etc/julia/startup.jl
[..]
>/cache/build/builder-armageddon-1/julialang/julia-master/src/codegen.cpp:9974:109: note: #pragma message: JIT profiling support (JL_USE_*_JITEVENTS) not yet available on platforms that use JITLink
 9974 | #pragma message("JIT profiling support (JL_USE_*_JITEVENTS) not yet available on platforms that use JITLink")
      |                                                                                                             ^
In file included from /cache/build/builder-armageddon-1/julialang/julia-master/src/codegen.cpp:2404:
/cache/build/builder-armageddon-1/julialang/julia-master/src/cgutils.cpp: In function 'llvm::StructType* get_memoryref_type(llvm::LLVMContext&, llvm::Type*, const jl_datatype_layout_t*, unsigned int)':
/cache/build/builder-armageddon-1/julialang/julia-master/src/cgutils.cpp:718:20: note: parameter passing for argument of type 'const jl_datatype_layout_t::<unnamed struct>' changed in GCC 9.1
  718 | static StructType *get_memoryref_type(LLVMContext &ctxt, Type *T_size, const jl_datatype_layout_t *layout, unsigned AS)
      |                    ^~~~~~~~~~~~~~~~~~

On that platform failing and unrelated, or false alarm, so it doesn’t give (newer) developers good confidence that other errors are real.]