I need to read in an Arrow file, combine some data, and then overwrite that file. That means I need to release the file lock that happens when you read. How can I guarantee to release that lock? I’ve read that clearing out references and gc’ing should work, but I think the below test indicates it doesn’t. It gives the error “opening file “data/test.arrow”: Invalid argument” which indicates the file lock issue. The lock seems to never go away until I restart the repl.
Windows 11, Julia 1.9.0, Arrow 2.6.2, DataFrames 1.6.1
using Arrow, DataFrames
function make_file()
v = rand(10)
df = DataFrame([v], [:a])
Arrow.write("data/test.arrow", df)
end
function read_file()
Arrow.Table("data/test.arrow")
end
function update_file()
tbl = read_file()
df = DataFrame(tbl; copycols=true)
tbl = nothing
df = nothing
GC.gc()
df2 = DataFrame([rand[5]], [:a])
Arrow.write("data/test.arrow", df2)
end
function test()
make_file()
update_file()
end
I even changed read_file to the following and still had the same problem:
function read_file()
tbl = nothing
open(filename) do io
tbl = Arrow.Table(io)
end
return tbl
end
This one should work, although I would just write it as Arrow.Table(read(filename)). The issue with Arrow.Table(filename) is that Arrow mmap’s the file.
Are you sure you ran that code? Because when I copy-paste it, it errors in two spots (rand[5] should be rand(5) and filename is not defined in the second read_file definition).
Right, sorry. I pasted it piecemeal and so it didn’t come out right. Here’s the whole test along with read_file2 based on your suggestion. Your suggestion is different: it uses a byte array instead of an IOStream. So… using IOStream doesn’t work either, but read it first before giving it to Arrow and then it does.
Is this a bug (or needed enhancement), though? It’s odd that it seems impossible to release the lock for the mmap’ed case.
using Arrow, DataFrames
filename = "data/test5.arrow"
function make_file()
v = rand(10)
df = DataFrame([v], [:a])
Arrow.write(filename, df)
end
function read_file()
tbl = nothing
open(filename) do io
tbl = Arrow.Table(io)
end
return tbl
end
function read_file2()
tbl = Arrow.Table(read(filename))
return tbl
end
function update_file()
tbl = read_file() # change to read_file2 and then it works
df = DataFrame(tbl; copycols=true)
tbl = nothing
df = nothing
GC.gc()
df2 = DataFrame([rand(5)], [:a])
Arrow.write(filename, df2)
end
function test()
make_file()
update_file()
end
It does seem a bit strange, I think this gotcha should be documented at least. There also could be an mmap=false keyword argument, or maybe mmapping just shouldn’t happen by default. On a technical level, the problem is that the mmap isn’t released until the memory associated to it is finalized (e.g. it goes out of scope and gets GC’d, or the julia process exits, or such).
That’s not necessarily a problem if it worked. And it appears that it is supposed to. This issue mentions that as the expected way to release it, so… this does appear to be a bug. I’ll open an issue on Arrow.jl for it.
So, right now, I can’t mmap an arrow file anywhere at all if I ever expect to want to modify that file which is a problem because I need fast access to the files (and some are 10’s of MB).
Not being able to release mmapped files on Windows is a known issue. I can’t find the source right now because I’m on mobile, but I believe if you search this forum for “Parquet mmap file release”, you’ll find discussion of the same thing!
I found a workaround. Rename the file and then write a new one. I tried to rename by using Filesystem.tempname so it would clean up automatically when closing the repl, but it didn’t work. I think the locks were still there so it couldn’t delete them (I saw an exception briefly as it was closing). So, it leaves a mess, but at least it “works”.
function update_file()
tbl = read_file()
df = DataFrame(tbl; copycols=true)
df.a[1] = 1.0
mv(filename, tempname(dirname(filename)))
Arrow.write(filename, df)
end