StefanKarpinski:
Like I said, we tried it @ScottPJones ’s way for two years and it has truly awful ergonomics for real-world data processing – this is not hypothetical or just my opinion, the data ecosystem has been struggling with it badly.
Please stop misrepresenting things totally. Many people here may not be aware of the facts of the situation
(for which I have ample evidence).
I didn’t change anything AT ALL in the way strings were handled prior to my starting to contribute to Julia back in April 2015, except for fixing (some) of the many bugs, and greatly improving the performance of conversions.
In v0.3.x, you had ASCIIString
, UTF8String
, UTF16String
, and UTF32String
.
See the following definition: https://github.com/JuliaLang/julia/blob/release-0.3/base/utf8.jl#L163 , i.e.
convert(::Type{UTF8String}, a::Array{Uint8,1}) = is_valid_utf8(a) ? UTF8String(a) : error("invalid UTF-8 sequence")
The philosophy then was that if you converted something to a UTF8String
, it was checked for validity.
I did not change that one bit.
I did fix bugs: such as #10919 (my very first Julia PR), also found a very serious problem in #10958 , in my first few weeks after I first saw Julia.
@stevengj said at the time, about #10958:
Whether we should accept (and silently convert) modified UTF-8 to standard UTF-8 is a separate issue; I tend to agree, but let’s keep that out of this discussion. After reading the RFCs, I agree that we shouldn’t produce the overlong NUL encoding ourselves
which Jeff also agreed with.
Also: Steven brought up the following back then, which may still be a problem:
Some of the functions in utf8.c seem to assume valid UTF-8, which may not be produced e.g. by bytestring(ptr, len).
Other string related things I fixed that were included in the v0.4 release:
JuliaLang:master
← ScottPJones:spj/moretests
opened 11:25PM - 15 Aug 15 UTC
The code had `x.data` instead of `x.string.data` in a few places.
Since there we… re no unit tests for those functions, the bugs had not been discovered previously.
JuliaLang:master
← ScottPJones:spj/u8reverse
opened 02:42PM - 16 Aug 15 UTC
`reverse` on a `UTF8String` used the C function `u8_reverse`, which I discovered… in testing has several bugs.
1. It doesn't detect running off the end of the string when there is a char > 0x80
2. It picks up garbage bytes depending on the lead character
3. It is not portable to any machine that requires alignment.
I have rewritten it in Julia, and added tests that fully cover the function.
I wanted to remove `u8_reverse` from `src/support/utf8.c`, however that function is used by `flisp` for the `string.reverse` function, even though that function is apparently never used anywhere in any of the .scm code I have found in Base.
I wonder if the unused string functions in flisp, that are depending on broken C code, can simply be removed and save some space.
JuliaLang:master
← ScottPJones:spj/remu8reverse
opened 03:09PM - 21 Aug 15 UTC
The flisp `string.reverse` does not appear to be used anywhere (at least, not in… JuliaLang/julia), and depends on the function `u8_reverse` that has the potential of access violations if there is invalid data at the end of a string.
Removing it will eliminate the problem, and save a small amount of space.
JuliaLang:master
← ScottPJones:spj/remstring
opened 11:01PM - 22 Aug 15 UTC
Removed from flisp interpreter: `string.width`, `string.encode`, `string.decode`… , `string.split`,
`char.upcase`, `char.downcase`
Removed support functions in `utf8.c` no longer used after removal:
`u8_codingsize`, `u8_unescape`
JuliaLang:master
← ScottPJones:spj/deprecateindexreal
opened 07:11PM - 01 Sep 15 UTC
These functions depended on a version of `to_index`, which has been deprecated.
… I tried to add tests for these methods, because they showed up as not being covered,
however I was told not to, because they give a deprecation warning.
This now gives a better error to the user, giving a work-around, and also giving the method that they called that doesn't work any longer, and eliminates the coverage holes in `strings/basic.jl` and `char.jl`
and added a lot of unit tests (char and string functions had been very poorly covered previously):
JuliaLang:master
← ScottPJones:spj/testutf8
opened 12:51AM - 16 Aug 15 UTC
This should give almost 100% coverage of `unicode/utf8.jl`, there is still one l… ine with `reverse` that looks like the C code `u8_reverse` is broken that I'll need to address in a subsequent PR.
JuliaLang:master
← ScottPJones:spj/utf8sizeof
opened 03:09AM - 17 Aug 15 UTC
These functions were only used in packages, and not in Base.
I have already made… a PR https://github.com/jakebolewski/JuliaParser.jl/pull/21 to remove utf8sizeof.
I will make another PR to fix MutableStrings, but it hasn't been updated in 2 years.
JuliaLang:master
← ScottPJones:spj/u32test
opened 04:34PM - 23 Aug 15 UTC
Added test for uncovered line in utf32.jl
Changed Uint to UInt in types.jl
Remov… ed a duplicate convert method in utf16.jl
Add test to cover converting a long utf8 sequence in utf8.jl
Changed a method in checkstring.jl for better coverage
Added more tests for checkstring.jl, test accept_long_char option
JuliaLang:master
← ScottPJones:spj/testtypes
opened 06:51PM - 30 Aug 15 UTC
JuliaLang:master
← ScottPJones:spj/testchar
opened 12:30PM - 31 Aug 15 UTC
Add tests for getindex, bswap, ndims, size and typemin
Note: I noticed a number … of inconsistencies that should probably be dealt with in a post-0.4 PR.
`getindex('c',1,1,1)` is allowed, and returns `'c'`, but `getindex("c",1,1,1)` gets an error.
`bswap` on a `Char` should probably not be allowed, the operation only makes sense on the underlying codeunit, i.e. `UInt32`, not on `Char`.
JuliaLang:master
← ScottPJones:spj/testbasic
opened 02:54AM - 01 Sep 15 UTC
These tests should hopefully bring the coverage of strings/basic.jl to 100%