Why stringfy an array of UInt8s empties it?

Is this result documented?
What is the reason why v is emptied?

julia> v=[0x41,0x42]
2-element Vector{UInt8}:
 0x41
 0x42

julia> String(v)
"AB"

julia> v
UInt8[]

PS
Is there a more efficient and side-effect free way to transform an array of codeunits into string?

I found some of the answers

String(v::AbstractVector{UInt8})

  Create a new String object using the data buffer from byte vector v. If v is a Vector{UInt8} it will be truncated to zero        
  length and future modification of v cannot affect the contents of the resulting string. To avoid truncation of Vector{UInt8}     
  data, use String(copy(v)); for other AbstractVector types, String(v) already makes a copy.

  When possible, the memory of v will be used without copying when the String object is created. This is guaranteed to be the      
  case for byte vectors returned by take! on a writable IOBuffer and by calls to read(io, nb). This allows zero-copy conversion    
  of I/O data to strings. In other cases, Vector{UInt8} data may be copied, but v is truncated anyway to guarantee consistent      
  behavior.

perhaps a ‘!’ should be added

There is an open issue for exactly this String constructor truncates data. ¡ Issue #32528 ¡ JuliaLang/julia ¡ GitHub. You can read there about why it was made like this.

1 Like

For me the problem arose in an attempt to write differently and compare the performance of the various proposed solutions to this problem.
It didn’t take me long to find where the problem was, but in much more complex code this “hidden” behavior can be much harder to spot.


function multisbs3(str,sstr, s)
    cstr,csstr,cs=codeunits.([str,sstr, s])
    lcstr,lcsstr,lcs = length.([cstr,csstr,cs])
    l=lcs-(lcstr-lcsstr)
    fp=findfirst(cstr, cs)
    tmp=Vector{UInt8}(undef,l)
    res=Vector{String}(undef,4)
    j=1
    while !isnothing(fp)
        fs,ls=first(fp),last(fp)
        copyto!(tmp,1, cs,1,fs-1)
        copyto!(tmp,fs,csstr)
        copyto!(tmp,ls,cs,ls+1,lcs-ls)
        res[j]=String(copy(tmp))
        j+=1
        fp=findnext(cstr, cs,first(fp)+1)
    end
    res
end

Yes, you can use StringViews.jl.

using StrngViews
julia> function multisbsSV(str,sstr, s)
           cstr,csstr,cs=codeunits.([str,sstr, s])
           lcstr,lcsstr,lcs = length.([cstr,csstr,cs])
           l=lcs-(lcstr-lcsstr)
           fp=findfirst(cstr, cs)
           tmp=Vector{UInt8}(undef,l)
           res=Vector{StringView}(undef,lcs-lcsstr)
           j=1
           while !isnothing(fp)
               fs,ls=first(fp),last(fp)
               copyto!(tmp,1, cs,1,fs-1)
               copyto!(tmp,fs,csstr)
               copyto!(tmp,ls,cs,ls+1,lcs-ls)
               res[j]=StringView(tmp)
               j+=1
               fp=findnext(cstr, cs,first(fp)+1)
           end
           @view res[1:j-1]
       end
multisbs3 (generic function with 1 method)

julia> function multisbsSC(str,sstr, s)
           cstr,csstr,cs=codeunits.([str,sstr, s])
           lcstr,lcsstr,lcs = length.([cstr,csstr,cs])
           l=lcs-(lcstr-lcsstr)
           fp=findfirst(cstr, cs)
           tmp=Vector{UInt8}(undef,l)
           res=Vector{String}(undef,lcs-lcsstr)
           j=1
           while !isnothing(fp)
               fs,ls=first(fp),last(fp)
               copyto!(tmp,1, cs,1,fs-1)
               copyto!(tmp,fs,csstr)
               copyto!(tmp,ls,cs,ls+1,lcs-ls)
               res[j]=String(copy(tmp))
               j+=1
               fp=findnext(cstr, cs,first(fp)+1)
           end
           @view res[1:j-1]
       end
multisbs4 (generic function with 1 method)

julia> @btime multisbsSV("BB","A","BBBBBaBBCCDDDDDDDAA")
  497.423 ns (11 allocations: 672 bytes)
5-element view(::Vector{StringView}, 1:5) with eltype StringView:
 "BBBBBaACCDDDDDDDAA"
 "BBBBBaACCDDDDDDDAA"
 "BBBBBaACCDDDDDDDAA"
 "BBBBBaACCDDDDDDDAA"
 "BBBBBaACCDDDDDDDAA"

julia> @btime multisbsSC("BB","A","BBBBBaBBCCDDDDDDDAA")
  637.126 ns (16 allocations: 1.16 KiB)
5-element view(::Vector{String}, 1:5) with eltype String:
 "ABBBaBBCCDDDDDDDAA"
 "BABBaBBCCDDDDDDDAA"
 "BBABaBBCCDDDDDDDAA"
 "BBBAaBBCCDDDDDDDAA"
 "BBBBBaACCDDDDDDDAA"

This is an abstractly typed container. You maybe want

res = StringView{Vector{UInt8}}(undef,lcs-lcsstr)

But I don’t think this does what you want:

It uses the same (mutable) array tmp as the storage for every element res[j], so the next j iteration overwrites the data for the previous StringViews in res.

Note that this allocates an array just to assign it to 3 variables. If you want to use this style I would use a tuple, e.g.

str,csstr,cs=codeunits.((str,sstr, s))

which is allocation-free.

I tried, but …

julia>     res = StringView{Vector{UInt8}}(undef,lcs-lcsstr)
ERROR: MethodError: no method matching StringView{Vector{UInt8}}(::UndefInitializer, ::Int64)

I need it as a temporary buffer to put the codeunits to be transformed into output strings.
It gets completely overwritten every cycle, but that’s okay

I could avoid this by writing the variuos vector of UInt8 into a Vector and then doing String.(Vector(vector(UInt8))), but that seems less efficient

Sorry, I meant:

res = Vector{StringView{Vector{UInt8}}}(undef,lcs-lcsstr)

It’s not okay because you’re not using it as a temporary buffer — you are also using it as the permanent storage for the strings in res. (I’m not sure you understand the “view” concept here?) That’s why all of the strings returned by multisbsSV in your example above were the same.

The alternative idea to String(copy()) was this. The fact that I was unable to define a vector of vectors of the type StringView made me create a hybrid that ran into a typical problem where the last used “view” overwrites all previous elements.

Similar problem to what I would have if I defined
res= fill(fill(0x00,18),19)
I use 18 and 19 because I get bored looking for suitable generic formulas :grinning:

using StringViews
function multisbsSV(str,sstr, s)
    cstr,csstr,cs=codeunits.((str,sstr, s))
    fp=findfirst(cstr, cs)
    #res = Vector{StringView{Vector{UInt8}}}(undef,lcs-lcsstr)
    res=[fill(0x00,18) for _ in 1:19]
    j=1
    while !isnothing(fp)
        fs,ls=first(fp),last(fp)
        copyto!(res[j],1, cs,1,fs-1)
        copyto!(res[j],fs,csstr)
        copyto!(res[j],ls,cs,ls+1,lcs-ls)
        j+=1
        fp=findnext(cstr, cs,first(fp)+1)
    end
   StringView.(res[1:j-1])
end

PS
I tried to use this definition, but it gives me problems when I have to use res[j] inside copyto!().

res = Vector{StringView{Vector{UInt8}}}(undef,lcs-lcsstr)

julia> multisbsSV("BB","A","BBBBBaBBCCDDDDDDDAA")
ERROR: UndefRefError: access to undefined reference

In fact I have some doubts about the general validity of preallocating an array of arrays with undefined elements.