How to mmap a string

I have a really large text file with one line in it. For slightly smaller files I just read it in with:

line = readline(filename)

I then process line from left to right. As a toy example:

for i in 1:10             
 println(line[i])
end

I would like to do the same thing but with mmap this time instead of readline. How can I do that? I tried playing with variants of:

using Mmap
length = Base.stat(filename).size
s = open(filename)
L = Mmap.mmap(s, String, length)

but none of my efforts work and I am not sure what the right thing to do is. This attempt gives:

ERROR: MethodError: no method matching mmap(::IOStream, ::Type{String}, ::Int64)

It seems the String type is not permitted as it is not a “bits-type”. Let me show the code that I use to process the string (written by Jakob Nissen).

import Automa
import Automa.RegExp: @re_str
const re = Automa.RegExp

machine = (function ()
    # Primitives
    start = re"\["
    stop = re"\]"
    sep = re"," * re.opt(re.space())
    number = re"[0-9]+"
    numelem = number * (sep | stop)
    elems = re"[^\[]+" * re.rep(start | stop | sep | numelem)
    start.actions[:enter] = [:start]
    stop.actions[:enter] = [:stop]
    number.actions[:enter] = [:mark]
    number.actions[:exit] = [:number]
    return Automa.compile(elems)
end)()
actions = Dict(
    :start => quote
        # level > 1 && error("X")
        level == 1 && (inner = Int32[])
        level += 1
    end,
    :stop => quote
        # level == 0 && error("")
        level == 2 && push!(outer, inner)
        level -= 1
        level == 0 && (done = true)
    end,
    :mark => :(mark = p),
    :number => quote
   n = Int32(0)
    @inbounds for i in mark:p-1
        n = n * 10 + Int32(data[i] - 0x30)
    end
    push!(inner, n)
end
)
context = Automa.CodeGenContext()
@eval function parsestring(data::Union{String,Vector{UInt8}})
    mark = 0
    level = 0
    done = false
    inner = Int32[]
    outer = Vector{Int32}[]
    $(Automa.generate_init_code(context, machine))
    p_end = p_eof = lastindex(data)
    $(Automa.generate_exec_code(context, machine, actions))
    if (cs != 0) & (!done)
        error("failed to parse on byte ", p)
    end
    return outer
end

How can I process the file without first reading the whole thing in? Is it impossible to use mmap for this?

Looking at that manual:

https://docs.julialang.org/en/v1/stdlib/Mmap/

The second parameter, the type, must be an Array, or a BitArray.

So I’m not 100% on this but you might try:

using Mmap
size = Base.stat(filename).size
s = open(filename)
L = Mmap.mmap(s, Vector{UInt8}, size)
str = String(L)

String(L) “takes” the buffer for the string. I’m not sure what happens under the covers but I think there is a good chance str now points to the memory mapped region. You would probably have to watch the memory usage with a large file to be sure String(L) didn’t just copy the buffer.

I also worry that this is an unexpected usage of mmap so I would worry (if it works) that it will stop working in a future release…

No that won’t happen and it’s impossible. Strings are immutable. If you need a string, you either need to copy the content or find a string type that reference an underlying array.

Okay, I will say that after doing that String(L), length(L) does equal 0, so I guess it’s copying the buffer then shrinking L for consistency sake?

Thank you. Although I am definitely confused now. I just tried it and:

s = open(filename)
size = Base.stat(filename).size
L = Mmap.mmap(s, Vector{UInt8}, size)
str = String(L)

works. But it seems that I have now explicitly read in the entire file into the variable str which was exactly what I was trying to avoid. Has L also read in the entire file?

I was hoping mmaping would avoid my having to read in the whole file in one go.

Ah this sounds bad. Is it hard to make a string type that references an underlying array?

Yeah the String(L) appears to copy the buffer if the buffer is mmapped…if it was just a normal array it takes the buffer from the Vector without the copy…I believe.

Maybe you can access small ranges of L and reuse the memory?

For example, I created a file with the text This is a simple test, and I can do:

julia> String(L[10:13])
" sim"

Of course, if the string is anything but ASCII you’ll have to be more careful with the indexing.

Since all of the patterns you are looking for are based on ASCII characters like [ etcetera, you can easily scan the raw bytes (a mmapped Vector{UInt8} array, for example) for these.

It should not be hard. In fact it was how String used to be implemeted. You can also just use the UInt8 vector.

One issue is that a lot of the regex functionality in Base currently seems to require a String (and makes a String copy for other string types). (We don’t have an abstract type for UTF-8 encoded strings backed by a byte array.)

I am not exactly sure how to do that efficiently where I have to construct integers, for example, by reading in one digit at a time (see the code I included in my question).

You just read in one digit at a time and compute num = num * 10 + digit — this is exactly what the “built-in” parsing code does anyway.

Really would like an enhancement where – a mmapped file that is readonly-- a String should
not make a memcpy viz:

import Mmap
s=String(Mmap.mmap(open(somefile)))

should just use the backing readonly Vector{UInt8} as is. After all the OS is guaranteeing the
Vector{UInt8} is immutable (at least within the program).

It already guarantees a no copy for read(io,String) and such like…

Here’s a small test to show the copy going on:

import Mmap
# test if String copies memory
open("test.txt", "w") do f
    write(f, "abcdefghijk\n")
end
# open readonly
f = open("test.txt")
size = filesize(f)
v = Mmap.mmap(f)
@assert length(v) === size
s = String(v)
# length of v is now zero
@assert length(v) === 0
# length is 0 but the data is still there!
for i in 1:size
    vs = unsafe_load(pointer(s), i)
    vp = unsafe_load(pointer(v), i)
    @assert vs == vp
end
println("string start...")
print(s)
# you can write to a string if you want ;)
unsafe_store!(pointer(s), 66, 1)
vp = unsafe_load(pointer(s), 1)
@assert vp == 66
@assert s[1] == Char(66)
println("string now...")
print(s)
# can't write to the vector though: ReadOnlyMemoryError
# must be different address from String
unsafe_store!(pointer(v), 66, 1)

What string operations do you need to perform that you couldn’t easily do with a Vector{UInt8}? Probably it wouldn’t be hard to define a new AbstractString type that supports the operations you need and wraps an mmapped array.

(Doing so with the built-in String type would require more low-level plumbing changes.)

I’ve done the Abstract String thing (it works). But all the horrible utf-8 code is attached to String not Vector{UInt8} or AbstractString so I had to copy and paste :frowning:

There’s C code in there that will take ownership of Vector{UInt8} buffers for zero copy take!(io)
functionality. Just seems it should be extended to readonly mmapped buffers too.

No there isn’t. The code is simply extracting a string that was hidden in the array, it doesn’t work zero copy for anything else.

Is the required interface for AbstractString well-defined? I was looking for it the other day, and it at least doesn’t seem to be documented in the manual.

See the docstring ?AbstractString. It is one of the better-documented interfaces.