Designing a Paths Julep

My question is “why have such an involved hierarchy of types?” (from your diagram)
Versus just two types (AbstractPath and Path{AbstractString}).
The latter would already enable all the virtual/in-memory/etc filesystems to be implemented in a clean way.

1 Like

What about handles? They’re not paths.

Large part of your hierarchy is paths, not handles. I guess we can just focus on paths for now: why this hierarchy instead of 1 abstract + 1 concrete path? What are specific code examples enabled by this?

2 Likes

If you want to leave the entire question of handles aside, sure. Having Windows/Posix/Local paths allows you to reason about paths on a different platform.

The design you are proposing – is it for Base, or the scope is Base + packages? If the latter, then clearly scoping what is intended to go into Base would be useful…

  • Is it potentially useful to have platform-specific paths for a few platforms (like windows and posix) that can be used independent of the current platform? Yes, there are definitely some scenarios for that. Can be handled by PlatformSpecificPaths.jl that defines POSIXPath <: AbstractPath etc.
  • Is it useful to have them in Base, with 6 path types (your diagram) instead of 1 (Path{AbstractString})? Idk, seems like vast overcomplication for the by-far-the-most-common usecase.

Base, with the faded elements of the diagram indicating what I think packages might implement.

PosixPath and WindowsPath are used in the implementation of LocalFilepath, and I do think that they’re worth exposing.

PlatformPath could potentially be dropped, but I don’t really see any other oppotunities (PlainPath is used to simplify the implementation, as it introduces a bunch of simplifying assumptions + generic functions that wouldn’t be appropriate for the abstract type: it’s an implementation detail more than anything else).

My point is that it’s worth having at least a couple of complete code examples that become possible/much easier with the proposed approach vs something truly minimal – Path{AbstractString} with the same exact same semantics of all FS functions as currently in Julia, just typed for dispatch.

That’s what PEPs typically do well, they compare to alternative less/more involved designs.

1 Like

That’s a good comment, and something I’ll bear in mind as I get to the stage where I’m adding examples to the proposal.

For now, you’re getting the in-flight design+proposal+prototype though.

Not to put too blunt a cap on this, but this doesn’t allow for handles at all.

Sounds like a great time to discuss specific motivation for this specific design :slight_smile:

As someone who implemented multiple “virtual filesystems” in Julia over the years, really the only major gripe I noticed is that Base uses strings in many places. This is trivially solvable by introducing AbstractPath with reasonably defined semantics and all Base functions using it.

But I just couldn’t understand the motivation for such an involved hierarchy…

It’s essentially what has emerged after multiple rounds of design trying to support:

  • Paths, and
  • Handles, and
  • Virtual filesystems

All with first-class support and clean separations/abstractions.

That’s why I’m asking if you have any specific code examples already. Fine if not, of course! Didn’t realize whether this is at a very early stage vs almost ready to actually propose for Base.

It is and it isn’t :sweat_smile:. As you can tell from the thread, I’ve been thinking about this for over a year now, but originally left virtual filesystems out of spec.

I’ve been mulling about whether and how it could be worth incorporating VFS support into the design over the past few months in the background, and what you’re seeing here is some work from the last few weeks overhauling the design to accommodate virtual filesystems well.

From here, I expect to update the prototype to fully realise this design, experiment with it, implement a few virtual filesystems, and based on that tweak the design/proposal. Once I’m happy with that, I’ll be encouraging people to try the prototype out, finishing the Julep document, and then seriously proposing it be put into Base.

Maybe I’m not fully understanding things, but it seems like Base already has some of the things that you are proposing. The correspondence might not be exact, but the mapping seems like this:

Your Name Base Name
AbstractHandle IO
handle open

To elaborate,

  • The IO abstraction seems similar to the AbstractHandle abstraction. And in fact the IOStream docstring says this:

    A buffered IO stream wrapping an OS file descriptor. Mostly used to represent files returned by open.

  • The open function “resolves” a path to an IO object.

From what I’ve read recently, it sounds like having a vanilla OS file descriptor is not enough to avoid TOCTOU issues. It sounds like you need some extra special logic that most languages don’t have. Is that what you have in mind for AbstractFileHandle? The wikipedia page makes TOCTOU sound like an open problem…

1 Like

The key nuance here is that IOStream holds a stream that relates to a file descriptor, it doesn’t hold the file descriptor itself.

The sequence isn’t path → IOStream, it’s path → fd ( → inode) → IOStream. What the new file handle type does is elevate the intermediate “resolved path, but not opened for IO” file descriptor form to a first-class member of the path/filesystem model (Julia currently has Base.Filesystem.File, but it’s largely used as an implementation detail ATM).

You open a handle to get an IOStream.

Holding onto a file descriptor isn’t a silver bullet, but it’s the closest thing to one. Since you referenced the Wikipedia article, there’s a key bit of it I’d like to quote:

In the context of file system TOCTOU race conditions, the fundamental challenge is ensuring that the file system cannot be changed between two system calls. In 2004, an impossibility result was published, showing that there was no portable, deterministic technique for avoiding TOCTOU race conditions when using the Unix access and open filesystem calls.[11]

I’m guessing this is one of the bits you were thinking of when you said “The wikipedia page makes TOCTOU sound like an open problem”. I’d like to draw your attention to the next sentence from this section (Preventing TOCTOU):

Since this impossibility result, libraries for tracking file descriptors and ensuring correctness have been proposed by researchers.[12]

That’s what this design does: using file descriptors to avoid re-resolving the path string, and so prevents atime-based attacks. See (Reliably timing TOCTOU):

Exploiting a TOCTOU race condition requires precise timing to ensure that the attacker’s operations interleave properly with the victim’s.

A handle (file descriptor) based approach renders the attack described in the Wikipedia page’s example impossible, because the attacker is no longer able to sneakily swap out the resource used by the program, for two reasons:

  1. Since the path isn’t re-resolved, the attacker isn’t able to time the swap correctly
  2. Since we hold onto the resource handle, even if it’s renamed/swapped out, the write will still go to the same resource that was checked in the first step

Each of these is independently sufficient to prevent the example attack.

1 Like

I think one crucial detail that may be missed here is that not just open(::AbstractPath) can give you a handle (as I understand it, that could continue to open the file right away), but also things like isfile! This is ultimately what prevents the TOCTU problem, because once you checked that the file exists, you already have something that ensures the file continues to exist (and points to the same thing you intended to use) until you actually open it for reading/writing the data within.

2 Likes

As far as I understand, POSIX file descriptors always refer to open files. However, from your proposal it’s not clear if handle(::AbstractPath) is supposed to open a file (and create an OS file descriptor). In POSIX the way to get a file descriptor is to use the open function on a path. If handle is intended to open a file, then open does seem like a better name. Or perhaps a new name like fdopen would work, since we already have an open function.

It’s not as formal as your proposal, but it looks like we do have some of the elements in Base already:

  • RawFD: a public primitive type that represents an OS file descriptor.
  • fd: get a file descriptor from a stream.
  • fdio: create an IOStream from a file descriptor.
  • stat has a RawFD method (and also methods for Integer and IOStream).

So, we can do the following now:

open(path) do io
    f = fd(io)
    print(stat(f).mtime)   
    write(io, "hello world")
end

Regarding AbstractResolveable: I don’t think we need it. I think the documentation should make it very clear that developers should implement new file-based methods in terms of AbstractFileHandle where appropriate. It should be up to users to handle path resolution failures. The fact that you need to use the following puzzling definition of AbstractResolveable indicates that it’s probably not the best idea:

abstract type _AbstractResolvable{H} end # Internal implementation detail

abstract type AbstractHandle{H} <: _AbstractResolvable{AbstractHandle{H}} end

const AbstractResolvable{H <: AbstractHandle} = _AbstractResolvable{AbstractHandle{H}}

abstract type AbstractPath{H} <: AbstractResolvable{H} end
1 Like

Yep, on POSIX systems I do indeed mean an FD from the open syscall, but opened with O_PATH (i.e. not for reading or writing, and applies to directories as well as files).

We do! I’ve used all of these myself. The major work here is around the arrangement of features/capabilities, and formalising how the concepts of paths/files/VFSs should be defined and interact.

I see why you’d say that, but that leaves us without a common supertype for path-like things, which I see as inevitably leading to path/handle-colouring of functions, which is a distinctly undesirable outcome in my mind.

Agreed, though this is separate from AbstractResolvable.

But I don’t think a handle is a path-like thing. :slight_smile: I think it’s better to keep the two concepts separate rather than try to unify them under the idea of “anything that can be resolved”. A handle has already been resolved. In my opinion, trying to resolve a handle should be a method error.

1 Like

That’s why the two (paths and handles) have separate supertypes, and why I’ve gone with handle for “give me a handle-form of the argument” rather than resolve :wink: