Conversion of Vector{UInt8} to String without copy

Hi,

use case: large files with a binary structure and embedded “text” elements.

  • Depending on the file size, we either read them fully into mem, or mmap them. Hence in both cases, the data are fully available in memory.
  • Our files potentially have millions of records
  • Because we know the binary format, we know the offset and length of the embedded text element in each record.
  • We are leveraging the embedded text elements as dictionary keys (or index) to quickly access records within the file.

We seek an approach to access the embedded text in each record without copying it and without resetting the underlying vector{uint8} (please see here). We don’t need to modify the String (embedded text), and we don’t want to reset the underlying vector.

Since we leverage the embedded text elements as dictionary keys, copying them, consumes a lot of unnessary (and in our case) precious memory.

Is there an approach - maybe even unsafe - that allows us to create a String-like object (like SubString) on top of an underlying Vector{UInt8} without copying it and without resetting the underlying uint8? Our input is an index or pointer, and length (number of bytes), which should be sufficient to create a substring-like object.

many thanks for your help

2 Likes

You can always define your own AbstractString subtype for this. What operations do you need to perform on the string? Just dictionary-key lookup?

2 Likes

Yes, only dictionary lookup.

I found meanwhile the WeakRefStrings package and I’m currently experimenting with WeakRefString(pointer(line, pos(this).start), length(this))

They’ve your idea, a subtype of AbstractString.

Then that’s easy, you only need to implement the Base.hash and Base.isequal functions, copying the hash(::String) method so that you get the same hash as for String (if desired). For example, to wrap a Vector{UInt8} in this way:

struct VectorString{T} <: AbstractString where {T<:AbstractVector{UInt8}}
    data::T
end
function Base.hash(s::VectorString, h::UInt)
    h += Base.memhash_seed
    ccall(Base.memhash, UInt, (Ptr{UInt8}, Csize_t, UInt32), s.data, length(s.data), h % UInt32) + h
end
Base.isequal(s1::VectorString, s2::VectorString) = s1.data == s2.data
Base.isequal(s1::String, s2::VectorString) = codeunits(s1) == s2.data
Base.isequal(s1::VectorString, s2::String) = isequal(s2, s1) 

I would also tend to overload Base.show for pretty-printing:

Base.show(io::IO, s::VectorString) = show(io, String(copy(s.data)))

and of course you could add other string methods as needed. This gives:

julia> sv = VectorString([0x66, 0x6f, 0x6f, 0x62, 0x61, 0x72])
"foobar"

julia> typeof(sv)
VectorString{Array{UInt8,1}}

julia> hash(sv)
0x54fc7dff7f029834

julia> hash("foobar") # same hash!
0x54fc7dff7f029834

julia> d = Dict{AbstractString,Int}(sv => 3, "blah" => 4, "foo" => 5)
Dict{AbstractString,Int64} with 3 entries:
  "blah"   => 4
  "foobar" => 3
  "foo"    => 5

julia> d[sv]
3

julia> d["foobar"] # can lookup by String even if key is StringVector
3

julia> sv2 = VectorString(codeunits("blah"))
"blah"

julia> typeof(sv2) # codeunits returns a type of AbstractVector{UInt8}
VectorString{Base.CodeUnits{UInt8,String}}

julia> d[sv2] # can look up by StringVector even if key is String
4

julia> sv3 = VectorString(@view sv.data[1:3]) # also works with subarrays/views
"foo"

julia> typeof(sv3)
VectorString{SubArray{UInt8,1,Array{UInt8,1},Tuple{UnitRange{Int64}},true}}

julia> d[sv3]
5

You could even do the same thing with raw pointers if needed, but this is easy to get wrong (you’ll want to read about garbage collection and GC.@preserve if you are ever working with raw pointers). This is why the WeakRefStrings.jl is explicitly documented as potentially unsafe and “is discouraged for general users”.

6 Likes

I’ve often felt there should be package (VectorStrings.jl?) for this: a safe (unlike WeakRefStrings), full-featured string type that wraps an arbitrary AbstractVector{UInt8}. This also came up recently in discussions of mmapping strings.

It wouldn’t be too hard to write; I’d be willing to help out if someone wanted to start such a package (under https://github.com/JuliaStrings).

There would be a fair amount of copy-and-paste code from Base.String, but that’s not so terrible — the String code has been stable for a long while. (Alternatively, one could add an AbstractByteString type to Base and generalize a bunch of String methods, but that’s not going to happen any time soon; it would be better to try it out in a package first and then potentially merge into Base later if it proves very useful.

1 Like

Update: I created a draft package for this:

2 Likes

Thanks Steven, I’ll give it a try. Anything I can help with, despite that I’m new to Julia?

Hi Steven, we’ve updated our project and StringView is working great. We did not have any issues and the performance is en par with our previous code. Thanks a lot for your help.

Juergen

1 Like