Isdir in Windows 67 times slower than Linux

Jeremy · March 22, 2023, 9:26am

I need to list a large number of files and parse their names. My function works but the performance in Windows is 67 times slower despite running on a more powerful computer. The number of files is unknown and the user can select the recursion depth so I have to separate files from folders. The slow point appears to be with isdir. The files are on a shared network drive and both computers are accessing the same data.

Is there a way to speed this up?

Linux

julia> using BenchmarkTools

julia> @btime isdir.(readdir("/home/user/data/2022/Q4",join=true))
  103.151 ms (17677 allocations: 2.06 MiB)
4416-element BitVector:

Windows

julia> using BenchmarkTools

julia> @btime isdir.(readdir("D:\\data\\2022\\Q4",join=true))
  6.993 s (79501 allocations: 5.36 MiB)
4416-element BitVector:

Linux Computer Info

julia> versioninfo()
Julia Version 1.8.5
Commit 17cfb8e65ea (2023-01-08 06:45 UTC)
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 4 × Intel(R) Core(TM) i5-4570 CPU @ 3.20GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-13.0.1 (ORCJIT, haswell)
  Threads: 4 on 4 virtual cores

Windows Computer Info

julia> versioninfo()
Julia Version 1.8.5
Commit 17cfb8e65e (2023-01-08 06:45 UTC)
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: 16 × Intel(R) Core(TM) i9-9900K CPU @ 3.60GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-13.0.1 (ORCJIT, skylake)
  Threads: 7 on 16 virtual cores
Environment:
  JULIA_NUM_THREADS = 7

barucden · March 22, 2023, 9:37am

Could you time the stat function on both platforms, please?

using BenchmarkTools
@btime stat("/some/path/to/your/directory")

Jeremy · March 22, 2023, 9:50am

When I profiled my function the calling function that took so much time was isdir but the lowest level function was indeed stat. The overhead in Windows was enormous but minor in Linux.

Linux

214╎    ╎    ╎    ╎    ╎    ╎ 214  @Base/stat.jl:150; stat(path::String)

Windows

74712╎    ╎    ╎    ╎    ╎    ╎ 74717 @Base\stat.jl:150; stat(path::String)

Linux

julia> @btime stat("/home/user/data/2022/Q4")
  2.318 μs (1 allocation: 224 bytes)
StatStruct for "/home/user/data/2022/Q4"
   size: 0 bytes
 device: 73
  inode: 258212007
   mode: 0o040755 (drwxr-xr-x)
  nlink: 2
    uid: 1000 (user)
    gid: 1000 (user)
   rdev: 0
  blksz: 1048576
 blocks: 0
  mtime: 2023-01-09T14:32:08+0100 (71 days ago)
  ctime: 2023-01-09T14:32:08+0100 (71 days ago)

julia> @btime stat.(readdir("/home/user/data/2022/Q4",join=true))
  99.829 ms (17676 allocations: 2.49 MiB)
4416-element Vector{Base.Filesystem.StatStruct}:

Windows

julia> @btime stat("D:\\data\\2022\\Q4")
  1.187 ms (1 allocation: 192 bytes)
StatStruct for "D:\\data\\2022\\Q4"
   size: 0 bytes
 device: 1866537516
  inode: 258212007
   mode: 0o040666 (drw-rw-rw-)
  nlink: 1
    uid: 0
    gid: 0
   rdev: 0
  blksz: 4096
 blocks: 0
  mtime:  (71 days ago)
  ctime:  (71 days ago)

julia> @btime stat.(readdir("D:\\data\\2022\\Q4",join=true))
  7.018 s (79500 allocations: 5.80 MiB)
4416-element Vector{Base.Filesystem.StatStruct}:

rafael.guerra · March 22, 2023, 10:03am

How large?

Jeremy · March 22, 2023, 10:04am

75,331 at the moment

nilshg · March 22, 2023, 10:11am

Somewhat old comment but potentially still relevant:

github.com/Microsoft/WSL

Major performance (I/O?) issue in /mnt/* and in ~ (home)

opened 11:31PM - 11 Aug 16 UTC

Mika56

file system

# A brief description As a Symfony developer, it's always been hard to get a st…able/fast development environment. My current setup is a Ubuntu running under VirtualBox (using vagrant). While page generation is fast, my IDE accesses my PHP files through SMB, which is really (sometimes horribly) slow. I'm now trying to use WSL to improve all of this. However, I'm having a major performance issue when using `/mnt/*` folders. If I set up a Symfony project under `/mnt/c`, it is really slow. If I set it up under `/home/mikael`, it is very fast. # Expected results Drives mounted under /mnt should be as fast a any other folder. # Actual results With a new Symfony 3.1.3 project, under `/home/mikael` takes between 100ms and 130ms to generate the home page. The same project under `/mnt/c/` takes between 1200ms and 1500ms. # Your Windows build number 10.0.14393.51 # Steps / commands required to reproduce the error ``` # Install PHP5 $ sudo apt-get install -y php5 php5-json # Download Symfony installer $ sudo curl -LsS https://symfony.com/installer -o /usr/local/bin/symfony $ sudo chmod a+x /usr/local/bin/symfony # Download Symfony cd symfony new symfony_test # Start Symfony cd symfony_test php bin/console server:run ``` Open your browser and go to http://127.0.0.1:8000/. Once the page is loaded, refresh it (on first request, Symfony had to generate its cache). Generation time is displayed on the bottom left ![Image](http://glados.aperture.fr.nf/up/load/2016/08/2016-08-12_01-18-38.png) You can then do the same under `/mnt/c/` ``` cd /mnt/c/ symfony new symfony_test cd symfony_test php bin/console server:run ``` # Additional information I've added my dev folders as excluded folders in Windows Defender, as well as %LOCALAPPDATA%\lxss. I've tried having my project in `~` and pointing my IDE to %LOCALAPPDATA%\lxss\home\mikael\ but as I've later read, there is no supported way of editing WSL files. WSL is installed in its default location under C (no strange junction or symlink), which is a healthy SSD. My computer is attached to a domain, if this might have any influence.

From there:

Linux has a top-level directory entry cache that means that certain queries (most notably stat calls) can be serviced without calling into the file system at all once an item is in the cache. Windows has no such cache, and leaves much more up to the file systems. A Win32 path like C:\dir\file gets translated to an NT path like ??\C:\dir\file, where ??\C: is a symlink in Object Manager to a device object like \Device\HarddiskVolume4. Once such a device object is encountered, the entire remainder of the path is just passed to the file system, which is very different to the centralized path parsing that VFS does in Linux.

Performance impacts in Julia have previously been discussed here:

Jeremy · March 22, 2023, 10:18am

Thanks for the info. I guess I’ll just have to live with it. Two minutes is still tolerable. Sadly, I won’t be able to convince IT switch it to Linux.

nilshg · March 22, 2023, 10:19am

WSL2 might be your friend in that case.

barucden · March 22, 2023, 10:22am

Or there might be a way to pre-filter the files before passing them to isdir? Maybe some files have extensions that you could use to exclude immediately without calling isdir?

Jeremy · March 22, 2023, 10:32am

I thought WSL2 needs administrator access to install. The speed improvements were also only OK. Other people have to use my code as well so it’s not something I could easily scale.

@barucden, I have considered that as well. I wanted to check the forum for a more elegant solution first though. It will probably be the most practical solution.

rafael.guerra · March 22, 2023, 10:37am

Check also this package: ScanDir.jl

On my small Windows 11 laptop, it ran 20x faster on a folder with 75,331 dummy files.

NB:
I have benchmarked the command:
isdir.(scandir(path))

blackeneth · March 22, 2023, 10:07pm

Network drives with Windows are always a problem.

Web search “windows network drives slow” for various tips to speed them up.

I like this blog:

… because he goes over the various solutions that have worked for him over the past 5 years.

Jeremy · March 23, 2023, 7:48am

It may be the case for someone else but not for me. The network connection is otherwise very fast. There are no noticeable delays when opening a file via Explorer.

The data folders in my case are on a RAID 5 Samba share on a Linux server (only Linux sysadmin in IT) via a 1GBit/s fiberoptic connection.

mkitti · March 23, 2023, 8:38am

Note that the actual call for Windows probably goes through libuv here:

github.com

JuliaLang/libuv/blob/fa7058b865e3c4a5a9c9ff511ed3e589ce817a85/src/win/fs.c#L1565


      
          size_t len;
          const WCHAR* fmt;
          WCHAR* find_path;
          uv_dir_t* dir;
          
          
pathw = req->file.pathw;
          dir = NULL;
          find_path = NULL;
          
          
/* Figure out whether path is a file or a directory. */
          if (!(GetFileAttributesW(pathw) & FILE_ATTRIBUTE_DIRECTORY)) {
            SET_REQ_UV_ERROR(req, UV_ENOTDIR, ERROR_DIRECTORY);
            goto error;
          }
          
          
dir = uv__malloc(sizeof(*dir));
          if (dir == NULL) {
            SET_REQ_UV_ERROR(req, UV_ENOMEM, ERROR_OUTOFMEMORY);
            goto error;
          }

Needs more research but I think Julia is using stat or its equivalent on Windows.

isdir(path) calls isdir(stat(path))

github.com

JuliaLang/julia/blob/489d076452130c718c7d77b157b0d503bfc31602/base/stat.jl#L461


      
              :issetgid,
              :issticky,
              :uperm,
              :gperm,
              :operm,
              :filemode,
              :filesize,
              :mtime,
              :ctime,
          ]
              @eval ($f)(path...)  = ($f)(stat(path...))
          end
          
          
islink(path...) = islink(lstat(path...))
          
          
# samefile can be used for files and directories: #11145#issuecomment-99511194
          function samefile(a::StatStruct, b::StatStruct)
              ispath(a) && ispath(b) && a.device == b.device && a.inode == b.inode
          end
          
          
"""

isdir(stat) is defined here:

github.com

JuliaLang/julia/blob/489d076452130c718c7d77b157b0d503bfc31602/base/stat.jl#L371


      
          true
          
          
julia> rm(filename);
          
          
julia> isfile(filename)
          false
          ```
          
          
See also [`isdir`](@ref) and [`ispath`](@ref).
          """
          isfile(st::StatStruct) = filemode(st) & 0xf000 == 0x8000
          
          
"""
              islink(path) -> Bool
          
          
Return `true` if `path` is a symbolic link, `false` otherwise.
          """
          islink(st::StatStruct) = filemode(st) & 0xf000 == 0xa000
          
          
"""
              issocket(path) -> Bool

stat calls jl_stat

github.com

JuliaLang/julia/blob/489d076452130c718c7d77b157b0d503bfc31602/base/stat.jl#L163


      
                  end
                  st = StatStruct($(esc(arg)), stat_buf)
                  if ispath(st) != (r == 0)
                      error("stat returned zero type for a valid path")
                  end
                  return st
              end
          end
          
          
stat(fd::OS_HANDLE)         = @stat_call jl_fstat OS_HANDLE fd
          stat(path::AbstractString)  = @stat_call jl_stat  Cstring path
          lstat(path::AbstractString) = @stat_call jl_lstat Cstring path
          if RawFD !== OS_HANDLE
              global stat(fd::RawFD)  = stat(Libc._get_osfhandle(fd))
          end
          stat(fd::Integer)           = stat(RawFD(fd))
          
          
"""
              stat(file)
          
          
Return a structure whose fields contain information about the file.

jl_stat calls uv_fs_stat

github.com

JuliaLang/julia/blob/489d076452130c718c7d77b157b0d503bfc31602/src/sys.c#L119


      
          // --- stat ---
          JL_DLLEXPORT int jl_sizeof_stat(void) { return sizeof(uv_stat_t); }
          
          
JL_DLLEXPORT int32_t jl_stat(const char *path, char *statbuf) JL_NOTSAFEPOINT
          {
              uv_fs_t req;
              int ret;
          
          
    // Ideally one would use the statbuf for the storage in req, but
              // it's not clear that this is possible using libuv
              ret = uv_fs_stat(unused_uv_loop_arg, &req, path, NULL);
              if (ret == 0)
                  memcpy(statbuf, req.ptr, sizeof(uv_stat_t));
              uv_fs_req_cleanup(&req);
              return ret;
          }
          
          
JL_DLLEXPORT int32_t jl_lstat(const char *path, char *statbuf)
          {
              uv_fs_t req;
              int ret;

uv_fs_stat calls ``fs__stat_handle`

github.com

JuliaLang/libuv/blob/fa7058b865e3c4a5a9c9ff511ed3e589ce817a85/src/win/fs.c#L3631


      
            err = fs__capture_path(req, path, NULL, cb != NULL);
            if (err) {
              SET_REQ_WIN32_ERROR(req, err);
              return req->result;
            }
          
          
  POST;
          }
          
          

          
int uv_fs_stat(uv_loop_t* loop, uv_fs_t* req, const char* path, uv_fs_cb cb) {
            int err;
          
          
  INIT(UV_FS_STAT);
            err = fs__capture_path(req, path, NULL, cb != NULL);
            if (err) {
              SET_REQ_WIN32_ERROR(req, err);
              return req->result;
            }
          
          
  POST;

fs__stat_handle

github.com

JuliaLang/libuv/blob/fa7058b865e3c4a5a9c9ff511ed3e589ce817a85/src/win/fs.c#L1685


      
          
          
void fs__closedir(uv_fs_t* req) {
            uv_dir_t* dir;
          
          
  dir = req->ptr;
            FindClose(dir->dir_handle);
            uv__free(req->ptr);
            SET_REQ_RESULT(req, 0);
          }
          
          
INLINE static int fs__stat_handle(HANDLE handle, uv_stat_t* statbuf,
              int do_lstat) {
            FILE_ALL_INFORMATION file_info;
            FILE_FS_VOLUME_INFORMATION volume_info;
            NTSTATUS nt_status;
            IO_STATUS_BLOCK io_status;
          
          
  nt_status = pNtQueryInformationFile(handle,
                                                &io_status,
                                                &file_info,
                                                sizeof file_info,

mkitti · March 23, 2023, 9:21am

Have you considered using walkdir for this? Scandir.walkdir does sound promising.

Jeremy · March 23, 2023, 11:43am

In the end I did the workaround suggested by barucden. I limited the isdir checks to paths without extensions. Performance is nearly identical with Linux now even though there may be the rare edge case where a folder looks like it has an extension. Our file organization isn’t that bad so it should never be a problem.

barucden · March 23, 2023, 12:31pm

I now feel obligated to suggest ScanDir.jl too I did not now that package but it looks that it avoids repeated calls to stat, so it should solve the problem without any compromises.

Topic		Replies	Views
Standard streams much slower in Windows than in Linux General Usage	4	1088	June 5, 2019
Why is os.walk() + regex so much slower than glob Performance question , regex	2	1679	August 7, 2020
Searching files on disk running 28 times faster in Python than in Julia. What am I doing wrong? General Usage	13	830	January 7, 2024
[Code Review] Dirstat ported from C++ and Java New to Julia	18	484	June 28, 2024
Any benchmark comparing Julia on Windows vs Linux vs OSX? Performance	7	2907	June 29, 2021

Related topics