Best way to find the last saved file in a folder?

phantom · June 1, 2023, 5:39pm

Hi!

I have a folder where new files are constantly being added and I would like to know the best way to find the name of the last file saved. Thus far I am doing something like

files = readdir(folder, join = true)
lastsvdfile = files[argmax(mtime.(files))]

But there are millions of files in the folder and it takes a bit of time each time it is called. Just wondering if there was a better way to go about it? I thought maybe if I set sort =false, readdir might return the files in a chronological list but the result is still alphabetical. Thanks so much!

stevengj · June 1, 2023, 5:49pm

You might find specific filesystems that support this kind of query more efficiently, but I doubt there is any portable way to do it besides querying the mtime of all the files individually as you are doing now.

You could avoid allocating so many temporary arrays. e.g. you could simply do

lastsvdfile  = argmax(mtime, readdir(folder, join=true, sort=false))

which only allocates 1 array for the result of readdir. (See also julia#27450 for discussiong of avoiding the allocation from readdir.) But I’m guessing that the time here is dominated by the filesystem accesses and not by the array allocations.

If you have millions of files in a directory then maybe you shouldn’t be using so many individual files, rather than storing data in one big file in some format.

phantom · June 1, 2023, 6:02pm

Thanks! I originally had a few larger files but I couldn’t figure out how to append incoming data to a partition of an existing arrow table without bringing the entire file into RAM. I’ll try rethinking the format a little better.

stevengj · June 1, 2023, 6:05pm

HDF5 has more support for this kind of use-case. (In the low-level HDF5 API, you can make a dataset appendable. Presumably this is possible from the Julia HDF5.jl package too but I don’t see explicit mention of it in the documentation? You might have to do a little digging.)

phantom · June 1, 2023, 6:10pm

awesome I’ll look into it. Thanks so much!

Topic		Replies	Views
Storing huge amount of data efficiently Performance performance , jld2 , numerics , io , arrow	15	2678	February 24, 2023
Rename multiple files in the folder with Julia Data filesystem	9	2338	June 20, 2021
List files New to Julia question , filesystem	12	31256	August 22, 2021
How to read only the last line of a file (.txt)? General Usage question , io	24	4872	September 12, 2021
Proposal: working with larger than memory data in hdf5 format using HDF5Arrays (implementation of DiskArrays.jl for HDF5) Data hdf5	11	1727	November 4, 2020

Best way to find the last saved file in a folder?

Related topics