I need to list a large number of files and parse their names. My function works but the performance in Windows is 67 times slower despite running on a more powerful computer. The number of files is unknown and the user can select the recursion depth so I have to separate files from folders. The slow point appears to be with isdir. The files are on a shared network drive and both computers are accessing the same data.
Is there a way to speed this up?
Linux
julia> using BenchmarkTools
julia> @btime isdir.(readdir("/home/user/data/2022/Q4",join=true))
103.151 ms (17677 allocations: 2.06 MiB)
4416-element BitVector:
Windows
julia> using BenchmarkTools
julia> @btime isdir.(readdir("D:\\data\\2022\\Q4",join=true))
6.993 s (79501 allocations: 5.36 MiB)
4416-element BitVector:
Linux Computer Info
julia> versioninfo()
Julia Version 1.8.5
Commit 17cfb8e65ea (2023-01-08 06:45 UTC)
Platform Info:
OS: Linux (x86_64-linux-gnu)
CPU: 4 × Intel(R) Core(TM) i5-4570 CPU @ 3.20GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-13.0.1 (ORCJIT, haswell)
Threads: 4 on 4 virtual cores
Windows Computer Info
julia> versioninfo()
Julia Version 1.8.5
Commit 17cfb8e65e (2023-01-08 06:45 UTC)
Platform Info:
OS: Windows (x86_64-w64-mingw32)
CPU: 16 × Intel(R) Core(TM) i9-9900K CPU @ 3.60GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-13.0.1 (ORCJIT, skylake)
Threads: 7 on 16 virtual cores
Environment:
JULIA_NUM_THREADS = 7
When I profiled my function the calling function that took so much time was isdir but the lowest level function was indeed stat. The overhead in Windows was enormous but minor in Linux.
Somewhat old comment but potentially still relevant:
From there:
Linux has a top-level directory entry cache that means that certain queries (most notably stat calls) can be serviced without calling into the file system at all once an item is in the cache. Windows has no such cache, and leaves much more up to the file systems. A Win32 path like C:\dir\file gets translated to an NT path like ??\C:\dir\file, where ??\C: is a symlink in Object Manager to a device object like \Device\HarddiskVolume4. Once such a device object is encountered, the entire remainder of the path is just passed to the file system, which is very different to the centralized path parsing that VFS does in Linux.
Performance impacts in Julia have previously been discussed here:
Or there might be a way to pre-filter the files before passing them to isdir? Maybe some files have extensions that you could use to exclude immediately without calling isdir?
I thought WSL2 needs administrator access to install. The speed improvements were also only OK. Other people have to use my code as well so it’s not something I could easily scale.
@barucden, I have considered that as well. I wanted to check the forum for a more elegant solution first though. It will probably be the most practical solution.
It may be the case for someone else but not for me. The network connection is otherwise very fast. There are no noticeable delays when opening a file via Explorer.
The data folders in my case are on a RAID 5 Samba share on a Linux server (only Linux sysadmin in IT) via a 1GBit/s fiberoptic connection.
In the end I did the workaround suggested by barucden. I limited the isdir checks to paths without extensions. Performance is nearly identical with Linux now even though there may be the rare edge case where a folder looks like it has an extension. Our file organization isn’t that bad so it should never be a problem.
I now feel obligated to suggest ScanDir.jl too I did not now that package but it looks that it avoids repeated calls to stat, so it should solve the problem without any compromises.