How to release lock on Arrow table

taotree · December 2, 2023, 3:38pm

I need to read in an Arrow file, combine some data, and then overwrite that file. That means I need to release the file lock that happens when you read. How can I guarantee to release that lock? I’ve read that clearing out references and gc’ing should work, but I think the below test indicates it doesn’t. It gives the error “opening file “data/test.arrow”: Invalid argument” which indicates the file lock issue. The lock seems to never go away until I restart the repl.

Windows 11, Julia 1.9.0, Arrow 2.6.2, DataFrames 1.6.1

using Arrow, DataFrames
function make_file()
    v = rand(10)
    df = DataFrame([v], [:a])
    Arrow.write("data/test.arrow", df)
end
function read_file()
    Arrow.Table("data/test.arrow")
end
function update_file()
    tbl = read_file()
    df = DataFrame(tbl; copycols=true)
    tbl = nothing
    df = nothing
    GC.gc()
    df2 = DataFrame([rand[5]], [:a])
    Arrow.write("data/test.arrow", df2)
end
function test()
    make_file()
    update_file()
end

I even changed read_file to the following and still had the same problem:

function read_file()
    tbl = nothing
    open(filename) do io
        tbl = Arrow.Table(io)
    end
    return tbl
end

ericphanson · December 2, 2023, 3:47pm

This one should work, although I would just write it as Arrow.Table(read(filename)). The issue with Arrow.Table(filename) is that Arrow mmap’s the file.

Are you sure you ran that code? Because when I copy-paste it, it errors in two spots (rand[5] should be rand(5) and filename is not defined in the second read_file definition).

taotree · December 2, 2023, 4:01pm

Right, sorry. I pasted it piecemeal and so it didn’t come out right. Here’s the whole test along with read_file2 based on your suggestion. Your suggestion is different: it uses a byte array instead of an IOStream. So… using IOStream doesn’t work either, but read it first before giving it to Arrow and then it does.

Is this a bug (or needed enhancement), though? It’s odd that it seems impossible to release the lock for the mmap’ed case.

using Arrow, DataFrames
filename = "data/test5.arrow"
function make_file()
    v = rand(10)
    df = DataFrame([v], [:a])
    Arrow.write(filename, df)
end
function read_file()
    tbl = nothing
    open(filename) do io
        tbl = Arrow.Table(io)
    end
    return tbl
end
function read_file2()
    tbl = Arrow.Table(read(filename))
    return tbl
end
function update_file()
    tbl = read_file() # change to read_file2 and then it works
    df = DataFrame(tbl; copycols=true)
    tbl = nothing
    df = nothing
    GC.gc()
    df2 = DataFrame([rand(5)], [:a])
    Arrow.write(filename, df2)
end
function test()
    make_file()
    update_file()
end

ericphanson · December 2, 2023, 4:10pm

It does seem a bit strange, I think this gotcha should be documented at least. There also could be an mmap=false keyword argument, or maybe mmapping just shouldn’t happen by default. On a technical level, the problem is that the mmap isn’t released until the memory associated to it is finalized (e.g. it goes out of scope and gets GC’d, or the julia process exits, or such).

taotree · December 2, 2023, 4:15pm

That’s not necessarily a problem if it worked. And it appears that it is supposed to. This issue mentions that as the expected way to release it, so… this does appear to be a bug. I’ll open an issue on Arrow.jl for it.

So, right now, I can’t mmap an arrow file anywhere at all if I ever expect to want to modify that file which is a problem because I need fast access to the files (and some are 10’s of MB).

mrufsvold · December 2, 2023, 4:18pm

Not being able to release mmapped files on Windows is a known issue. I can’t find the source right now because I’m on mobile, but I believe if you search this forum for “Parquet mmap file release”, you’ll find discussion of the same thing!

taotree · December 2, 2023, 5:09pm

I think I found the issue: close() on an mmapped io / iostream does not actually fully close the file on Windows · Issue #49961 · JuliaLang/julia · GitHub
It’s Mmap that is the problem.

I found a workaround. Rename the file and then write a new one. I tried to rename by using Filesystem.tempname so it would clean up automatically when closing the repl, but it didn’t work. I think the locks were still there so it couldn’t delete them (I saw an exception briefly as it was closing). So, it leaves a mess, but at least it “works”.

function update_file()
    tbl = read_file()
    df = DataFrame(tbl; copycols=true)
    df.a[1] = 1.0
    mv(filename, tempname(dirname(filename)))
    Arrow.write(filename, df)
end

taotree · December 8, 2023, 2:23pm

Just a note: it might be obvious, but for that mv(...) call, it must be to the same drive. mv to a different drive will try to delete it and fail.

Topic		Replies	Views
Why trying to remove .arrow file fails with IOError on Windows? General Usage io , arrow	2	259	October 23, 2023
Find what has locked/held a file General Usage	3	2132	April 29, 2019
Mmap.mmap leaves the file open New to Julia mmap	5	1362	January 10, 2019
Releasing process memory from Arrow.jl Performance question , package , arrow	5	358	August 17, 2023
How well Apache Arrow’s zero copy methodology is supported? Data arrow	24	2792	May 1, 2021

How to release lock on Arrow table

Related topics