How to mmap a string

I have a really large text file with one line in it. For slightly smaller files I just read it in with:

line = readline(filename)

I then process line from left to right. As a toy example:

for i in 1:10             
 println(line[i])
end

I would like to do the same thing but with mmap this time instead of readline. How can I do that? I tried playing with variants of:

using Mmap
length = Base.stat(filename).size
s = open(filename)
L = Mmap.mmap(s, String, length)

but none of my efforts work and I am not sure what the right thing to do is. This attempt gives:

ERROR: MethodError: no method matching mmap(::IOStream, ::Type{String}, ::Int64)

It seems the String type is not permitted as it is not a “bits-type”. Let me show the code that I use to process the string (written by Jakob Nissen).

import Automa
import Automa.RegExp: @re_str
const re = Automa.RegExp

machine = (function ()
    # Primitives
    start = re"\["
    stop = re"\]"
    sep = re"," * re.opt(re.space())
    number = re"[0-9]+"
    numelem = number * (sep | stop)
    elems = re"[^\[]+" * re.rep(start | stop | sep | numelem)
    start.actions[:enter] = [:start]
    stop.actions[:enter] = [:stop]
    number.actions[:enter] = [:mark]
    number.actions[:exit] = [:number]
    return Automa.compile(elems)
end)()
actions = Dict(
    :start => quote
        # level > 1 && error("X")
        level == 1 && (inner = Int32[])
        level += 1
    end,
    :stop => quote
        # level == 0 && error("")
        level == 2 && push!(outer, inner)
        level -= 1
        level == 0 && (done = true)
    end,
    :mark => :(mark = p),
    :number => quote
   n = Int32(0)
    @inbounds for i in mark:p-1
        n = n * 10 + Int32(data[i] - 0x30)
    end
    push!(inner, n)
end
)
context = Automa.CodeGenContext()
@eval function parsestring(data::Union{String,Vector{UInt8}})
    mark = 0
    level = 0
    done = false
    inner = Int32[]
    outer = Vector{Int32}[]
    $(Automa.generate_init_code(context, machine))
    p_end = p_eof = lastindex(data)
    $(Automa.generate_exec_code(context, machine, actions))
    if (cs != 0) & (!done)
        error("failed to parse on byte ", p)
    end
    return outer
end

How can I process the file without first reading the whole thing in? Is it impossible to use mmap for this?

1 Like

Looking at that manual:

https://docs.julialang.org/en/v1/stdlib/Mmap/

The second parameter, the type, must be an Array, or a BitArray.

So I’m not 100% on this but you might try:

using Mmap
size = Base.stat(filename).size
s = open(filename)
L = Mmap.mmap(s, Vector{UInt8}, size)
str = String(L)

String(L) “takes” the buffer for the string. I’m not sure what happens under the covers but I think there is a good chance str now points to the memory mapped region. You would probably have to watch the memory usage with a large file to be sure String(L) didn’t just copy the buffer.

I also worry that this is an unexpected usage of mmap so I would worry (if it works) that it will stop working in a future release…

No that won’t happen and it’s impossible. Strings are immutable. If you need a string, you either need to copy the content or find a string type that reference an underlying array.

Okay, I will say that after doing that String(L), length(L) does equal 0, so I guess it’s copying the buffer then shrinking L for consistency sake?

Thank you. Although I am definitely confused now. I just tried it and:

s = open(filename)
size = Base.stat(filename).size
L = Mmap.mmap(s, Vector{UInt8}, size)
str = String(L)

works. But it seems that I have now explicitly read in the entire file into the variable str which was exactly what I was trying to avoid. Has L also read in the entire file?

I was hoping mmaping would avoid my having to read in the whole file in one go.

Ah this sounds bad. Is it hard to make a string type that references an underlying array?

Yeah the String(L) appears to copy the buffer if the buffer is mmapped…if it was just a normal array it takes the buffer from the Vector without the copy…I believe.

Maybe you can access small ranges of L and reuse the memory?

For example, I created a file with the text This is a simple test, and I can do:

julia> String(L[10:13])
" sim"

Of course, if the string is anything but ASCII you’ll have to be more careful with the indexing.

Since all of the patterns you are looking for are based on ASCII characters like [ etcetera, you can easily scan the raw bytes (a mmapped Vector{UInt8} array, for example) for these.

2 Likes

It should not be hard. In fact it was how String used to be implemeted. You can also just use the UInt8 vector.

One issue is that a lot of the regex functionality in Base currently seems to require a String (and makes a String copy for other string types). (We don’t have an abstract type for UTF-8 encoded strings backed by a byte array.)

2 Likes

I am not exactly sure how to do that efficiently where I have to construct integers, for example, by reading in one digit at a time (see the code I included in my question).

You just read in one digit at a time and compute num = num * 10 + digit — this is exactly what the “built-in” parsing code does anyway.

2 Likes

Really would like an enhancement where – a mmapped file that is readonly-- a String should
not make a memcpy viz:

import Mmap
s=String(Mmap.mmap(open(somefile)))

should just use the backing readonly Vector{UInt8} as is. After all the OS is guaranteeing the
Vector{UInt8} is immutable (at least within the program).

It already guarantees a no copy for read(io,String) and such like…

Here’s a small test to show the copy going on:

import Mmap
# test if String copies memory
open("test.txt", "w") do f
    write(f, "abcdefghijk\n")
end
# open readonly
f = open("test.txt")
size = filesize(f)
v = Mmap.mmap(f)
@assert length(v) === size
s = String(v)
# length of v is now zero
@assert length(v) === 0
# length is 0 but the data is still there!
for i in 1:size
    vs = unsafe_load(pointer(s), i)
    vp = unsafe_load(pointer(v), i)
    @assert vs == vp
end
println("string start...")
print(s)
# you can write to a string if you want ;)
unsafe_store!(pointer(s), 66, 1)
vp = unsafe_load(pointer(s), 1)
@assert vp == 66
@assert s[1] == Char(66)
println("string now...")
print(s)
# can't write to the vector though: ReadOnlyMemoryError
# must be different address from String
unsafe_store!(pointer(v), 66, 1)

What string operations do you need to perform that you couldn’t easily do with a Vector{UInt8}? Probably it wouldn’t be hard to define a new AbstractString type that supports the operations you need and wraps an mmapped array.

(Doing so with the built-in String type would require more low-level plumbing changes.)

1 Like

I’ve done the Abstract String thing (it works). But all the horrible utf-8 code is attached to String not Vector{UInt8} or AbstractString so I had to copy and paste :frowning:

There’s C code in there that will take ownership of Vector{UInt8} buffers for zero copy take!(io)
functionality. Just seems it should be extended to readonly mmapped buffers too.

No there isn’t. The code is simply extracting a string that was hidden in the array, it doesn’t work zero copy for anything else.

Is the required interface for AbstractString well-defined? I was looking for it the other day, and it at least doesn’t seem to be documented in the manual.

See the docstring ?AbstractString. It is one of the better-documented interfaces.

1 Like