Hi!
I have a folder where new files are constantly being added and I would like to know the best way to find the name of the last file saved. Thus far I am doing something like
files = readdir(folder, join = true)
lastsvdfile = files[argmax(mtime.(files))]
But there are millions of files in the folder and it takes a bit of time each time it is called. Just wondering if there was a better way to go about it? I thought maybe if I set sort =false
, readdir
might return the files in a chronological list but the result is still alphabetical. Thanks so much!
You might find specific filesystems that support this kind of query more efficiently, but I doubt there is any portable way to do it besides querying the mtime
of all the files individually as you are doing now.
You could avoid allocating so many temporary arrays. e.g. you could simply do
lastsvdfile = argmax(mtime, readdir(folder, join=true, sort=false))
which only allocates 1 array for the result of readdir
. (See also julia#27450 for discussiong of avoiding the allocation from readdir
.) But I’m guessing that the time here is dominated by the filesystem accesses and not by the array allocations.
If you have millions of files in a directory then maybe you shouldn’t be using so many individual files, rather than storing data in one big file in some format.
1 Like
Thanks! I originally had a few larger files but I couldn’t figure out how to append incoming data to a partition of an existing arrow table without bringing the entire file into RAM. I’ll try rethinking the format a little better.
HDF5 has more support for this kind of use-case. (In the low-level HDF5 API, you can make a dataset appendable. Presumably this is possible from the Julia HDF5.jl package too but I don’t see explicit mention of it in the documentation? You might have to do a little digging.)
1 Like
awesome I’ll look into it. Thanks so much!