String type that wraps an AbstractVector?

Do we have a string type that wraps a (sub-)vector of bytes in some package? Something a bit safer than WeakRefString (so not pointer based)?

julia> String(rand(UInt8, 16))
"9#N\xab>x\xe2VD\xa2\x84-B\r\x15\xbd"

From the question, I would guess that the following behavior is not desired:

julia> a = rand(UInt8, 16);

julia> b = String(a)
"-\xfb\xe2d\x922]\xbc\xac\xc2\xf1\xa6\xa7F3;"

julia> a
0-element Array{UInt8,1}

but more something like a mutable String

mutable struct MutableString{A <: AbstractVector{<:UInt8}}
    v::A
end

But the point is that AbstractString must not be mutable.

So you should rather write:

a = rand(UInt8, 16)
b = String(copy(a))

I think.

Perhaps the orignal poster is looking for codeunits.

2 Likes

Correct - this is for decoding some binary data with embedded strings (representing C++ type information, it’s ROOT files with custom streamers). I want to check what’s in these strings with minimum memory allocation and copies, so I’d like to just wrap a SubArray into something that implements an AbstractString. But the string may be passed around a bit (in a very limited fashion), so I’d like to avoid WeakRefString.

I’m kinda looking for the reverse of codeunits: A copy-free way to wrap a sub-array into something string-like, without invalidating the original array.

If I understand it correctly, this is deliberately disallowed because strings are treated as immutable. It would not be hard to create an AbstractVector with the behavior you are talking about.

Perhaps you could describe a bit about what you are trying to achieve? It seems to me that you could just go ahead and do whatever you like with a Vector{UInt8} and later call String on it. This is only inadequate if you need to repeatedly manipulate the string. Unicode gets a bit tricky, but again, it depends on what you are trying to do.

1 Like

It should be possible to create a StringView type that takes a region of a byte array and wraps it to act like a string. One would want to reuse much of the code for String and SubString{String}, which is non-trivial, so the ideal way to do this might require a little refactoring of the code to make it possible to reuse or it might be easier to add StringView to Base and add it to the dispatch on the relevant methods.

2 Likes

Sure, normally we definitely want Strings to be immutable. However, this is for a binary parsing application, and I’m checking the contents of strings imbedded in a buffer that will be destroyed afterwards. In the interest of performance, I don’t want to turn these embedded strings into actual Strings, as that would result in unnecessary memory allocation.

WeakRefString has been created for scenarios like this, but in my case I don’t want something pointer based, at least not explicitly (I may use UnsafeArrays at a higher level). Of course it’s not hard to create something similar that is backed by an AbstractVector{UInt8} instead of a Ptr{UInt8} - I just didn’t wondered if someone had already done so in some package, to avoid duplication of work.

Kinda like that, yes. I just wondered if anybody had already done something like it. Basically something like

struct StringView{BV<:AbstractVector{<:UInt8}}
    data::BV
end

When used with an Array or SubArray, it would be completely safe (GC-wise), but immutability of the string’s content would not be guaranteed. When used with an UnsafeArray, it would be an allocation-free bitstype like WeakRefString, but still guarded by the automatic GC.@preserve of UnsafeArrays.@uviews.

Update: There is now a package for this: GitHub - JuliaStrings/StringViews.jl: String-like views of arbitrary Julia byte arrays

It has some code duplication with Base, but it would be possible to reduce this significantly by a tiny amount of refactoring in a future Julia version.

4 Likes

Thanks a lot, @stevengj!