Indeed, the help entry for String is explicit in suggesting the copy, with some reasons given.
help?> String
search: String string StringIndexError Cstring Cwstring bitstring SubString include_string setrounding unsafe_string AbstractString
String(v::AbstractVector{UInt8})
Create a new String object from a byte vector v containing UTF-8 encoded characters. If v is Vector{UInt8} it will be truncated to
zero length and future modification of v cannot affect the contents of the resulting string. To avoid truncation use
String(copy(v)).
When possible, the memory of v will be used without copying when the String object is created. This is guaranteed to be the case
for byte vectors returned by take! on a writable IOBuffer and by calls to read(io, nb). This allows zero-copy conversion of I/O
data to strings. In other cases, Vector{UInt8} data may be copied, but v is truncated anyway to guarantee consistent behavior.
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
String(s::AbstractString)
Convert a string to a contiguous byte array representation encoded as UTF-8 bytes. This representation is often appropriate for
passing strings to C.
Certainly the low-level PCRE library supports this, so it would be possible to implement by replicating or generalizing some of the code in Base.
(This has come up a few times; Iβve often thought that it might be useful to put together a StringView type that wraps around any a::AbstractVector{UInt8} with stride(a,1) == 1 and exposes the regex methods etcetera.)
Sure, Iβll try out this approach If I get anywhere with it Iβll open a PR, since I imagine itβs probably generally useful functionality that more people could do with.
I think this feature request is a symptom of a far more wide-ranging problem that should be solved instead: namely that Julia stdlib still lacks a version of String suitable for processing binary and other non-UTF-8 string data, with all the facilities that String offers, including, but by far not limited to, regular expressions. I would therefore suggest that instead feature request #37979 is a more generic solution the the same problem, namely making the entire AbstractString API (including regex) easily available for processing byte sequences where Unicode is not of interest, by adding a binary/byte/basic-latin sibling of String which could be called BString . Basically a 1 byte = 1 character version of String without a UTF-8 decoder running behind the scenes all the time. Vector{UInt8} (mutable) seems more a workaround for the lack of an immutable binary string type.