I have a really large text file with one line in it. For slightly smaller files I just read it in with:
line = readline(filename)
I then process line from left to right. As a toy example:
for i in 1:10
println(line[i])
end
I would like to do the same thing but with mmap this time instead of readline. How can I do that? I tried playing with variants of:
using Mmap
length = Base.stat(filename).size
s = open(filename)
L = Mmap.mmap(s, String, length)
but none of my efforts work and I am not sure what the right thing to do is. This attempt gives:
ERROR: MethodError: no method matching mmap(::IOStream, ::Type{String}, ::Int64)
It seems the String type is not permitted as it is not a “bits-type”. Let me show the code that I use to process the string (written by Jakob Nissen).
The second parameter, the type, must be an Array, or a BitArray.
So I’m not 100% on this but you might try:
using Mmap
size = Base.stat(filename).size
s = open(filename)
L = Mmap.mmap(s, Vector{UInt8}, size)
str = String(L)
String(L) “takes” the buffer for the string. I’m not sure what happens under the covers but I think there is a good chance str now points to the memory mapped region. You would probably have to watch the memory usage with a large file to be sure String(L) didn’t just copy the buffer.
I also worry that this is an unexpected usage of mmap so I would worry (if it works) that it will stop working in a future release…
No that won’t happen and it’s impossible. Strings are immutable. If you need a string, you either need to copy the content or find a string type that reference an underlying array.
Thank you. Although I am definitely confused now. I just tried it and:
s = open(filename)
size = Base.stat(filename).size
L = Mmap.mmap(s, Vector{UInt8}, size)
str = String(L)
works. But it seems that I have now explicitly read in the entire file into the variable str which was exactly what I was trying to avoid. Has L also read in the entire file?
I was hoping mmaping would avoid my having to read in the whole file in one go.
Yeah the String(L) appears to copy the buffer if the buffer is mmapped…if it was just a normal array it takes the buffer from the Vector without the copy…I believe.
Since all of the patterns you are looking for are based on ASCII characters like [ etcetera, you can easily scan the raw bytes (a mmapped Vector{UInt8} array, for example) for these.
One issue is that a lot of the regex functionality in Base currently seems to require a String (and makes a String copy for other string types). (We don’t have an abstract type for UTF-8 encoded strings backed by a byte array.)
I am not exactly sure how to do that efficiently where I have to construct integers, for example, by reading in one digit at a time (see the code I included in my question).
Really would like an enhancement where – a mmapped file that is readonly-- a String should not make a memcpy viz:
import Mmap
s=String(Mmap.mmap(open(somefile)))
should just use the backing readonly Vector{UInt8} as is. After all the OS is guaranteeing the
Vector{UInt8} is immutable (at least within the program).
It already guarantees a no copy for read(io,String) and such like…
Here’s a small test to show the copy going on:
import Mmap
# test if String copies memory
open("test.txt", "w") do f
write(f, "abcdefghijk\n")
end
# open readonly
f = open("test.txt")
size = filesize(f)
v = Mmap.mmap(f)
@assert length(v) === size
s = String(v)
# length of v is now zero
@assert length(v) === 0
# length is 0 but the data is still there!
for i in 1:size
vs = unsafe_load(pointer(s), i)
vp = unsafe_load(pointer(v), i)
@assert vs == vp
end
println("string start...")
print(s)
# you can write to a string if you want ;)
unsafe_store!(pointer(s), 66, 1)
vp = unsafe_load(pointer(s), 1)
@assert vp == 66
@assert s[1] == Char(66)
println("string now...")
print(s)
# can't write to the vector though: ReadOnlyMemoryError
# must be different address from String
unsafe_store!(pointer(v), 66, 1)
What string operations do you need to perform that you couldn’t easily do with a Vector{UInt8}? Probably it wouldn’t be hard to define a new AbstractString type that supports the operations you need and wraps an mmapped array.
(Doing so with the built-in String type would require more low-level plumbing changes.)
I’ve done the Abstract String thing (it works). But all the horrible utf-8 code is attached to String not Vector{UInt8} or AbstractString so I had to copy and paste
There’s C code in there that will take ownership of Vector{UInt8} buffers for zero copy take!(io)
functionality. Just seems it should be extended to readonly mmapped buffers too.
Is the required interface for AbstractString well-defined? I was looking for it the other day, and it at least doesn’t seem to be documented in the manual.