Please provide a link.
As I’ve said before, I haven’t pushed it yet, I only started writing it on Dec. 15th (after I saw how #24999 adversely affected our code, which I was just starting to try to test on master, having only recently [this month!] moved from v0.5 to v0.6.1 for our deployments)
It will be at https://www.github.com/JuliaString/Strs.jl. I may be able to push a very WIP version tonight, if I can finish figuring out how to deal with a recent deprecation that causes an infinite loop,
on line #1590 of deprecate.jl in Julia.
I am also trying to enjoy the Xmas break with my family!
We should take care to distinguish a function and its methods. I’m talking about the functions, which are defined more or less by their help strings. If you’re saying that the methods as-implemented don’t satisfy the function’s documented semantics, than that’s a separate issue that doesn’t need to side-track this thread (but could be a new thread if you want).
Again, I think that these function’s location is consistent with their documented semantics. If you think the meaning of these functions should be different, please propose a different help string.
I thought I was being very clear, I am talking about the generic functions.
To me, it is clear that Julia should not have separate functions, in separate namespaces, but rather should have methods that act specifically on AbstractString
s, based on whatever their character set is, and if they are designed to use a run-time locale specification (thinking about it more last night, I think if you want to pick up locale specific mappings, that there should be a wrapper for strings (like SubString is a wrapper), that indicates a particular locale, or whether a settable default locale should be used), so that things like String
, ASCIIStr
or LatinStr
, etc. can continue to produce code that in-lines all or part of some of the tests, without having to look up the current locale every time, such as the C and C++ libraries (and most other languages) have to do.
Also, those functions come from the ANSI/ISO C, C++, and Posix standards (and before that, from K&R C, in the 70’s), not from Unicode.
Again, my point has been that having separate non-generic functions in different namespaces, instead of generic functions, extended with methods when loading different packages, is not Julian at all.
(Seems to me like people have been spending too much time programming in C++, instead of Julia, when I see things like this!)
I am pained to see this sort of statement. I realize that you disagree with some actions the stewards and moderators took or did not take, and that you feel strongly about this. But please consider that in this way you are in fact abandoning the community as a whole. The stewards and moderators are not the community, we are!
I wish you would re-engage and that you would constructively seek opportunities to make the community interactions better. I believe that goal is very much worth it: we all see something worthwhile in Julia, and if we can keep the community together, we are advancing a common good.
Best wishes into the New Year.
P
Normal vectors have a huge amount of overhead (40 bytes on 64-bit machines, vs. 16 bytes for StringVector
).
That’s doesn’t work well for good string performance either.
From looking at @code_native
output, using a plain Vector{UInt8}
instead of a StringVector
doesn’t seem to change that extra indirection.
Why would they be encoded as Unicode strings, or encoded as UTF-16 somehow?
Anyway, it’s not a good idea anyway to try to do case-insensitive comparisons by simply lowercasing (or uppercasing) and comparing (besides poor performance), please see the W3C and Unicode Org recommendations on how that should be handled.
I’d love to be able to sit down with a number of the Julia core team, and go over all the technical issues, and answer any questions people have about the not so simple area of dealing with text, instead of all this back and forth, quite frankly.
Dealing with linear algebra is not so simple, and dealing text is not very simple either.
Yes, we all know about Unicode case-folding (supported by Unicode.normalize_string
, though it would be nice to have an allocation-free case-folded string comparison function too at some point).
Again, please don’t assume that people who disagree with your design preferences don’t care about text processing or do so out of a lack of understanding of the tradeoffs of your proposals. That’s not a good starting point for discussion.
I don’t think using Unicode
is a big imposition for someone who wants to use a function like lowercase
. I agree that there is a tradeoff here, and my personal inclination would have been to keep lowercase
in Base, but I can see the arguments in both directions and I don’t think the decision is likely to change. Other major languages have also put this functionality into a package or require an include statement without causing major problems. How a function is named and which module it goes into are basically bikeshedding, anyway.
At some point, you need to take “no” for an answer.
We mostly agreed over a month ago that an AbstractChar
type was desirable to have at some point, but it’s a little late for 1.0 at this point.
Locale-dependent mapping is already available in the UnicodeExtras module and has been for some time, so it’s not as if any changes are required to enable this functionality. (It would be great to have a pure-Julia version of this functionality too, but re-implementing ICU is a huge amount of work and no one has been willing to take it on yet.)
UnicodeExtras’s last commit is from 18 Apr 2014 and nalimilan’s fork has last commit (“Fix deprecation warnings on Julia 0.4”) from 13 Oct 2015…
I am afraid that I don’t understand something.
And sorry I am still newbie! If I like to do something like
encode("Ålborg", "iso-8859-1")
I need to install package which is not in JuliaLang organization or code this functionality myself?
Things like locales and iso-8859-1 will probably always be separate packages (i.e. not in Base). Whether those packages are in the “JuliaLang” organization is kind of a secondary point — there is no special advantage to a package being in a particular github organization except in terms of who has commit access to it.
Yes, the ICU-based packages have not been updated in ages and will need work to get up and running on 1.0 … there hasn’t been anyone who needs this functionality enough to take over maintainance of that package. But my point was they need no new functionality in the Julia language or standard library.
That is true but I think we can leave the performance issue aside now. It is possible to implement string types in libraries that have the same performance as all pre-0.6 strings. Plus Jameson and I have both declared our intent to make the representation used by String
generally available, so please just wait for that.
This is explicitly required by the API for Win32 environment variables. Of course it’s wrong. But it’s a bit late to change…
The 16 byte StringVector overhead is in addition to the normal Vector overhead, not instead of.
I have been waiting, since the ideas of “memory buffers” to allow implementing arrays and strings totally in Julia was first discussed, at least a year and a half or two years ago?
We have customers lined up to use our product, and while they are doing testing deployed on Julia v0.6.1, they will want v1.0 after it comes out, so I can’t really wait any longer, we need to be able to do performant handling of strings and characters.
The latest changes to Char
have been a real performance killer, at least to our product, so without the ability to bypass those changes completely, we would be stuck on v0.6.2.
Really? I had thought from @jeff.bezanson’s comments on the PR that it was a just the pointer to the type, the length, with the bytes following, and a \0 at the end, rounded up to some larger size.
If it really is extra, then I’ll see if just a plain Vector{UInt8}
would be faster / take less space.
From testing just now, it looks like a single 0 to 15 byte string (random characters from 0x0
:0x7f
) stored in a Vector{String}
, takes 96 bytes per string, which is a lot more overhead than I expected.
Not sure what that has to do with anything, except that with what I’m doing, you’ll have a fast UCS2Str
and UTF16Str
, and can use either of the directly with the Win32 APIs, and conversions from UniStr
to UTF16Str
will be very fast, being a no-op in the case of a UniStr
that is stored as _UCS2Str
, a simple widening operation (which is trivial to optimize with SIMD instructions) if stored as ASCIIStr
or _LatinStr
, and still much faster than a UTF-8
to UTF-16
conversion if stored as _UTF32Str
.
I’ll show the numbers shortly.
All those types may make it seem complicated, but users will only have to worry about one type, UniStr
, unless they specifically need to convert to/from UTF-8
or UTF-16
, and they also won’t have to worry about non-direct indexing of characters.
Yes, but we nevertheless still intend to get to it eventually. It is not possible to give everybody everything they want, when they want it. I don’t know what else to say.
Let me clarify that StringVector
allocates a Vector
whose storage is backed by a String
. So it allocates two objects: a String, plus a Vector wrapper pointing to the String. The String can then be extracted with no allocation.
A String
is indeed just a type tag, length, bytes, and \0. String
is actually very close to a memory buffer type; it is worth experimenting with using it instead of Vector{UInt8} to try to shave off some space overhead. My PR https://github.com/JuliaLang/julia/pull/25241 also provides a StringBytes
type that exposes the contents of a String
as an immutable UInt8 vector, making it even easier to use as a byte vector.
If you need mutation, well, of course officially you shouldn’t mutate a String, but it can be done with unsafe operations, and _string_n(n)
in base/strings/string.jl is guaranteed to allocate a fresh String that will be “safe” to mutate.
Ah, will using _string_n(n)
just give me a string that I can initialize as desired, without the overhead of a full Vector
?
That is great information!
Thanks very much!!!
Edit: no, I haven’t implemented mutable strings in my package (yet), I only initialize them once, and am very careful always to allocate my own space when passed a Vector
from outside, and when returning a Vector
of any sort, via convert
, etc. to make a copy, to make sure that strings don’t have a hole to mutate them after they are created (it would totally wreck hell with my optimizations based on knowing validity, optionally caching hash values and/or other encodings (UTF-8 and/or UTF-16).
Great advice! Overhead dropped from 96 bytes (for up to 15 byte string), to 32 bytes, which is what I had been expecting.
Does the n take into account the trailing \0, or is that added to the allocation?
I noticed that for Base.StringVector
, 15 didn’t cause extra allocation, but 16 did, and since I won’t be using trailing \0 bytes, I should be able to use all 16 allocated bytes.
There’s an asymmetry here: if we keep uppercase
et al. in the Unicode
stdlib package now and decide later that it belongs in Base, we can always add it. We cannot go in the other direction. Note that I started this thread thinking that these functions probably did belong in Base, but the points presented actually make the case that it should not be in Base since the behavior of Unicode.uppercase
depends heavily on the Unicode standard. Certainly there may be other concepts of the English verb “uppercase” which one might want, but there is very clearly one, well-defined, precaise meaning which should exist: uppercasing as defined in the Unicode standard. That meaning is what the Unicode.uppercase
function is defined to implement.
The notion that because a function is generic it can do arbitrary things on different input types is a fundamental misunderstanding of what generic functions and programming are about. Generic functions are not a license to make subtle, semantically significant changes of behavior based on input type. Quite the opposite: in order for generic programming to work in a sane way, one must be strict about each generic function implementing exactly one meaning with coherent, consistent behavior across all different input types. Someone may feel that there is another, less rigid meaning of the term “uppercase” which they prefer. That’s fine, that is why we have namespaces and another package can define its own uppercase
function.
Please try a thought experiment: You have a generic function, +
. For numbers it gives one result if you add 0xff
and 0x1
, and the result is 0x00
. If you add 255 and 1, you get 256, not 0.
Arithmetic on a Float16
, Float64
, BigFloat
, Dec128
will all give results that are different.
Why do you understand that case as being generic, but that uppercase
, which is simply a mapping (just as the +
operation maps from 2 values to a third value) is not generic?
I’ve never heard of anybody using Julia complain about those cases with numbers.
Julia has great facilities with the type system, and multiple dispatch, to be able to use the old C function names in a very generic fashion, with the ability to the results depend on 1) the character set of the string type, 2) the language/locale 3) and possibly even a custom mapping table (via a keyword argument, or an argument with a specific type, maybe Locale
, MappingTable
)
It’s important to note that the encoding is totally irrelevant to all of those functions.
Also note that the Unicode standard only provides defaults for, but then goes on to say that locale/language specific mappings should really be used.
The Unicode standard is very explicit that things like uppercase transformations should be able to handle language specific issues such as the Turkish dotted and dotless i, and that “ß” should be uppercased to “SS” in German.
See:
Q: Is all of the Unicode case mapping information in UnicodeData.txt?
A: No. The UnicodeData.txt file includes all of the one-to-one case mappings. Since many parsers were built with the expectation that UnicodeData.txt would have at most a single character in each case mapping field, the file SpecialCasing.txt was added to provide the one-to-many mappings, such as the one needed for uppercasing ß (U+00DF LATIN SMALL LETTER SHARP S). In addition, CaseFolding.txt contains additional mappings used in case folding and caseless matching. For more information, see Section 5.18, Case Mappings in The Unicode Standard.
and
A: The Unicode Standard defines the default case mapping for each individual character, with each character considered in isolation. This mapping does not provide for the context in which the character appears, nor for the language-specific rules that must be applied when working in natural language text.
What about using *
and ^
on strings? That would seem to be a rather clear example of the above.
Please check it out, it’s still rather rough, a lot of code hasn’t been optimized, more needs to be made to use traits instead of having things hard coded, many of my ideas for this I haven’t even started to implement (optional substrings, cached hash values, cached UTF-8 or UTF-16 or raw bytes)
https://github.com/JuliaString/Strs.jl
(I still need to make the logo I want for the JuliaString org 3 concentric circles, of the “Julia” colors, with ASCII/Latin1 text in the other, BMP (probably math, Japanese, maybe Hindo stuff), and finally non-BMP (a few emojis) in the center).
I’ll also make an announcement on the community channel, remember, this is very WIP still, I only started it on the 15th!
Just relax – I wish everybody a
HAPPY NEW YEAR !!!