Designing a Paths Julep

Ooof, that’s low!

I know the default soft limit for me is 1024, but the hard limit is ~half a million.

I think this depends on how much work is “held onto”. I don’t think it’s sensible to suggest that every time you think about a file you get a file descriptor and hold onto it. But while actively performing some unit of work it probably makes sense to hold onto a few handles. I don’t think this should be a problem even on systems like those you mentioned?

Quick one-off accesses could still be done with read(handle(path), String) etc. While we could of course make read(path, String) work too, I do think there’s a level of desirable friction to push people to write code that reuses handles whenever it’s sensible to do so.

With work like https://github.com/JuliaLang/julia/pull/45272 (and follow-ups) I wonder if handle could be implemented such that the compiler can perform eager finalisation for calls like this? (i.e. not rely on GC for closing the file descriptor).

I’m not sure if this is a great idea, because fundamentally a resource is a different beast to the path to a resource. We could have something like handle(h::AbstractHandle) -> h though, so that you can accept either and know that by calling handle(h) you end up with a handle. I’m half tempted to consider AbstractResource = Union{AbstractPath, AbstractHandle} but I think at this point we’re overcomplicating things. There’s definitely still room for improvement on the design though…

Yes, I’ve checked and I’m pretty confident this can be done (for a file descriptor) on all three major OSs.

3 Likes

That sounds reasonable, as long as we dont expect handles to be a user level thing. My main concern there was that you’d have to do a lot more input validation :sweat_smile: but if it’s more a matter of what happens after open then yeah, makes sense to me!

So I would expect the user level things like read(any_path) to work exactly because it’s a one off - you will presumably never use the handle after that. And the more path types we implement the weirder it becomes to add complexity if you as a user just want to read an s3 file.

I see the argument for there being some amount of friction to that though - maybe we could enable some preference or so that the user can set during development if they want, that will warn when a path is used in these operations and not a handle?

Well, I hope that package and user-written functions that “do something” with a path might take a handle instead of (or as well as) a path, but I also want to avoid making it feel difficult/complicated to deal with. This is one of the aspects of API design that I think needs more thought/attention.

I’m currently thinking of handles as something you have before open. While open lets you operate on the contents of a resource, a filesystem handle gives you operate on the resource itself: renaming it, listing the contents (for a directory), deleting it, etc.

I get this, it’s just (in my mind, right now) part of the grey area of the API: what I want to determine is where friction is useful vs. just annoying.

This touches on part of my hope for this work; a well-designed abstraction can decrease complexity. What I want is for nothing to be written for an S3 resource, but instead for code that takes a resource and does stuff to be able to be written completely independently of a package that provides an S3 path type, Zipfile path type, etc. and compose seamlessly..

This might be the best way forwards, on balance. Perhaps using something like depwarn so we can push packages to write code that accepts + reuses handles, while allowing users to “just use paths directly” without impediment?

1 Like

Thinking more on this, I’ve come to two conclusions on the matter of separate AbstractPath + AbstractHandle types.

  1. Relying on the right type always used in the right place is fragile. Users will be annoyed if they can’t provide a path. We can use handle(::AbstractPath or ::AbstractHandle) -> AbstractHandle within a filesystem interface to make it so that packages can write path/handle agnostic functions, but without a common supertype packages will inevitably end up specifying arguments as one or the other. Something like AbstractResource is needed.
  2. The problem with something like AbstractResource as a union is that it doesn’t encode what kind of handle a particular path can be resolved to.

We could have AbstractResolvable{H} (name WIP) with the semantics that handle(::AbstractResolvable{H}) -> H, but implementing that type isn’t easy.

I think F-bounded polymorphism would allow for H<:AbstractHandles that are a AbstractResolvable{H} neatly, but that’s not something that fits with Julia’s type lattice.

Currently, this is the best idea I have:

julia> abstract type _AbstractResolvable{H} end # Private/internal

julia> abstract type AbstractHandle{H} <: _AbstractResolvable{AbstractHandle{H}} end

julia> const AbstractResolvable{H} = _AbstractResolvable{AbstractHandle{H}}
AbstractResolvable (alias for _AbstractResolvable{AbstractHandle{H}} where H)

julia> abstract type AbstractPath{H} <: AbstractResolvable{H} end

julia> struct PretendFileDescriptor <: AbstractHandle{PretendFileDescriptor} end

julia> struct PretendFilePath <: AbstractPath{PretendFileDescriptor} end

julia> supertypes(PretendFileDescriptor)
(PretendFileDescriptor, AbstractHandle{PretendFileDescriptor}, AbstractResolvable{PretendFileDescriptor}, Any)

julia> supertypes(PretendFilePath)
(PretendFilePath, AbstractPath{PretendFileDescriptor}, AbstractResolvable{PretendFileDescriptor}, Any)

In this way both PretendFilePath and PretendFileDescriptor are subtypes of AbstractResolvable{PretendFileDescriptor}.

Because of Julia’s lack of abstract self types/F-bounded polymorphism, the one bit of awkwardness is the need to write struct Foo <: AbstractHandle{Foo} instead of struct Foo <: AbstractHandle working.

1 Like

This would be awesome!! :glowing_star: Thanks for your efforts!

1 Like

I’ve put my thinking cap on and have a v3 type system/design around paths and filesystems. It’s a little complex (more than I would like), but it’s the first design that fulfills all of the criteria I’ve come up with on this journey.

For fun, I tossed the design doc + implementation at ChatGPT, and then directed it to do a comparative evaluation. It might just be flattering me (take this with a pinch of salt), but I think the table it came up with is at least a of some interest, and a bit encouraging :slight_smile:

Dimension Your design Python (pathlib) Java (NIO.2) Rust std Rust cap-std Go (io/fs) POSIX Plan 9 Racket .NET WASI
Path is a structured value :locked: :white_check_mark: :white_check_mark: :white_check_mark: :cross_mark: :cross_mark: :cross_mark: :cross_mark: :locked: :cross_mark: :cross_mark:
Platform-specific path models (POSIX / Windows) :locked: :locked: :cross_mark: :wrench: :cross_mark: :cross_mark: :cross_mark: :cross_mark: :wrench: :cross_mark: :cross_mark:
Invalid paths unrepresentable / validated :locked: :wrench: :wrench: :wrench: :cross_mark: :cross_mark: :cross_mark: :cross_mark: :locked: :locked: :cross_mark:
Pure (FS-free) path manipulation :locked: :wrench: :cross_mark: :wrench: :cross_mark: :cross_mark: :cross_mark: :cross_mark: :locked: :wrench: :cross_mark:
Generic filesystem polymorphism (VFS) :locked: :cross_mark: :wrench: :cross_mark: :locked: :locked: :cross_mark: :wrench: :cross_mark: :cross_mark: :locked:
Paths are filesystem-relative (non-ambient) :locked: :cross_mark: :wrench: :cross_mark: :locked: :locked: :cross_mark: :locked: :cross_mark: :cross_mark: :locked:
Handles are authoritative resources :locked: :cross_mark: :wrench: :wrench: :locked: :wrench: :locked: :locked: :cross_mark: :locked: :locked:
Explicit resolution boundary :white_check_mark: :cross_mark: :cross_mark: :cross_mark: :locked: :cross_mark: :wrench: :wrench: :cross_mark: :cross_mark: :locked:
TOCTTOU resistance :white_check_mark: :cross_mark: :cross_mark: :wrench: :locked: :cross_mark: :wrench: :locked: :cross_mark: :wrench: :locked:
Capability-oriented APIs (opt-in) :wrench: :cross_mark: :wrench: :wrench: :locked: :cross_mark: :wrench: :locked: :cross_mark: :wrench: :locked:

Legend: :locked: Guaranteed/ensured :white_check_mark: Good support :wrench: Partial/weak support :cross_mark: No support

9 Likes

I think it’s really starting to come together now. This is the current type hierarchy:

I wish this was neater/simpler, but I don’t see any component that can be removed/combined without compromising the clarity of the semantics, or abandoning the virtual filesystem support. This could be a lot simpler without the VFS and capability (see: cap_rights_limit) support, but this diagram should also show why VFS and capability support can’t be well-integrated as an afterthought later/by some package, and I think these foundations are worth it.

That said, we don’t need to shove all of this in the face of a user/general package author. Just Path and the p"" macro should cover 98% of Julia code (this is a guess, but 95% seems too low, 99% is perhaps a bit high).

dothing(target::Path) = ...

dothing(p"my/file.csv")

That there’s this type hierarchy “under the hood” of Path, of which p"" produces just one of the supported types, becomes relevant when somebody wants to apply dothing to a file within a tarball/stored in S3/etc. and it just works.

There’s one further tweak that I’m tentatively considering (this is an invitation to shove opinions at me). As has been discussed earlier in this thread (Designing a Paths Julep - #22 by MilesCranmer, Designing a Paths Julep - #73 by jar1; CC: @MilesCranmer, @jar1, @ExpandingMan) which is putting the “is this path relative or absolute?” into the type domain.

As I mentioned in Designing a Paths Julep - #75 by tecosaur, my main reservation was the combinatorics expansion of subtypes. The alternative is to have a type parameter:

struct AbsoluteKind end
const Absolute = AbsoluteKind()

struct RelativeKind end
const Relative = RelativeKind()

const PathKind = Union{AbsoluteKind, RelativeKind}

and then introduce K<:PathKind as an extra type parameter in AbstractPath.

As I see it, the arguments for this are:

  1. It lets us be more principled with path operations

  2. We can catch operations that will/may silently drop a path with JET, e.g. joinpath(::AbsolutePath, ::AbsolutePath)

    Note in this thread there is one example objection that’s consistently appeared, and that’s joinpath(cwd(), <input>) working with both absolute and relative paths. Since this is trivially abspath(<input>), I’d be very interested to hear if any other counter-examples spring to mind.

  3. It lets us encode more invariants into the type system, such as wanting the handle-relative path in RelativePath to be relative, not absolute

To me, the clear arguments against are:

  1. More complexity (it’s one argument, but it’s a strong one)
  2. Potential type (parameter) instability when e.g. reading a path

Thoughts?

5 Likes

I really like the diagram you have made, as it makes many concepts clear. I only miss a list of method signatures for the abstract types that subtypes are supposed to implement. Such reference would make things clearer.

In the diagram it is unclear why there are some seperations. For instance, why AbstractHandle and AbstractFileHandle couldn’t be unified under single AbstractHandle{F} and similarly PlainPath and PlatformPath?

2 Likes

That’s coming, it just won’t easily fit in the diagram :sweat_smile: (AbstractFilesystem’s interface in particular).

This is one aspect of the design that you could chalk up to my sensibilities, rather than being an irreducible separation. The current relationship between Abstract{Resolvable,Path,Handle} is entirely described by

This is it. I really like the purity/simplicity of this part of the design, and so I’m inclined to actually add the filesystem interface/complications one layer under them.

Maybe this is just my ego talking, but I also could see this being reused for non-filesystem path-like and handle-like objects, so long as they fit into the diagram.

Posix/Windows/Local paths are paths for an operating system, while other conceivable path types are not (XPath, URL). That said, of all of the separations here I think this is the weakest. We could plausibly absorb PlatformPath into PlainPath if we don’t think it’s worth making the distinction between paths for an operating system vs. others.

I am still a bit confused about the hierarchy. I am really looking forward to the method signatures associated with each abstract type. It does not need to be comprehensive, just enough to give a sense of what operations are available. From the diagram, I already see that AbstractFilesystem is justified by its concrete subtypes and hence does not need much justification.

Perhaps I could also agree on the necessity of AbstractHandle{F} (or AbstractFileHandle{F} as it is now) as it could act as a layer on top of interaction. Perhaps some sandboxed permission access could be implemented here, with some concrete subtypes or logging of access and modifications.

I guess what confuses me most is the AbstractResolvable supertype. Are there actually methods one can write for AbstractResolvable that include the need to cover handles? Why not make AbstractPath the king instead?

1 Like

That’s actually (almost) already present, just not by the name you expect: RelativePath, thanks to the O_BENEATH/RESOLVE_BENEATH flag. This makes it straightforward to implement an equivalent of Go’s os.Root.

Yep, so Path is actually defined in terms of AbstractResolvable, and implementing the AbstractFilesystem interface lets you call read(::AbstractResolvable{MyFileHandle}).

The previous revision of the design had AbstractHandle as a subtype of AbstractPath (as I think you suggest) but that ended up muddying the design: handles are simply not Liskov substitutable for paths (to help ground this, a local filesystem handle is a file descriptor (fd) on Linux).

It might help to see some of the (in progress) reasoning around the design I’m writing up:

(I’ve moved on from the HackMD, the document has become a bit big for me to work on in its editor)

I also like examples :slight_smile: I’m currently partway through implementing a posix an in-memory filesystem. These should help you get a better feel for how this works. The AbstractFilesystem API is something that I’m hoping to iron out from this process: there are a few lines that need to be drawn, and I don’t yet have a clear sense of where exactly they should be (this is one of the purposes of the example implementations: to let me play with the design/interaface).

2 Likes

Not really. I wonder why AbstractHandle{F} live on it’s own. The handle type parameter is already present in AbstractPath and one can interact with filesystem through materialized path like LocalFilePath{LocalFileHandle}.

This is sound argument. However, I think the issue is that we have been accustomed to read(::AbstractPath) which seems like an adhoc convienience method for read(handle(::AbstractPath)) with automatic closure of the handle.

I guess AbstractResolvable is compreshensivelly argued here. But I still find it unsatisfying. Making package authors writing handle first code seems better where they manually could add AbstractPath at the public API interface for convinience.

Anyway, thank you for your diligence. It really starts to look desirable :face_savoring_food:

Oh absolutely, and one of the goals is to push package authors to write handle-first code.

I’m just rather worried that without paths and handles falling under a unifying type, we’ll end up with path/handle coloured methods. I fully expect that should this be the case, we’ll see functions written than only accept paths, not handles, making the function incompatible with other packages that want to pass a handle to the function.

2 Likes

The design is very interesting, thanks for working on it.

I know this will be namespaced, but I have the feeling, that AbstractHandle is too broad of a term here. If you think about handles for managing, e.g., figures, it’s probably not impossible, but quite a stretch to define the path of this figure.

Or take handles which are used for accessing arrays. If their path would be the underlying memory address of the array, then basically everything would be a path and I think that’s just not what people mean when they use the term path.

You could argue that you don’t need to use the path to work with these kind of handles and just subtype from AbstractHandle. While this might be true, it brings in even more complexity into the proposal and I don’t think this is currently worth it.

I think what you mean here is an AbstractFileHandle instead of an AbstractHandle. I haven’t yet understood what the difference to the AbstractFileHandle you defined really is, but if they really can’t be combined, you might be looking for an AbstractPathHandle instead of an AbstractHandle.

It’s turned out to be a much deeper and more interesting/tricky problem that I thought it could be. Thanks for engaging with it!

Yea, I’m essentially of two thoughts here. I feel that Abstract{Resolvable,Path,Handle} is a nice simple abstraction, that applies well to this design problem. I also see the appeal of forgetting about the self-contained niceness and just shove the filesystem-related details into those types, accepting the increased complexity.

Just for fun, I’ll raise that you could see a memory address as a one-step path :wink: but I just mention this as an aside, I don’t think it’s worth digging into these weeds.

1 Like

I wonder if there are any clear code examples, what this kind of design enables, @tecosaur?
Compared to a minimal baseline like

  • Define AbstractPath and Path{AbstractString} <: AbstractPath in Base
  • Add ::AbstractPath methods to all functions that currently take paths as strings

“Paths for data structures” already exist – Accessors.jl :slight_smile:

2 Likes

This is a good question to ask. I plan on adding some examples to the proposal doc, but I need to finish both writing up more of the thought process behind some of the design, and also do more work with the prototype/experimental implementation.

To pick just one example, using an in-memory filesystem for running tests isolated from the host filesystem.

memfs = MemoryFilesystem()
mroot = MemoryPath(memfs, "/")

@test dothings(mroot) == 42
@test read(p"$mroot/output/file", String) == "expected output"
# ...
2 Likes

For me at least, it’s less about what is enabled and more about what is prevented. Just Path still leaves the door wide open for TOCTU problems. That’s solved by making a clear distinction between a path & the ressource at that path (put differently, a key and a value in a Dict - just think about Dictionaries.jl and its handle-based API, or if not that than one of the packages that provides a handle-API for dicts. There was one at some point :thinking:). The addition of AbstractFilesystem then makes this generic over the kind of storage backing the Path. Abstracting further then leads to AbstractRessource, which should be a good fit anywhere a TOCTU problem can crop up.

2 Likes

Maybe I’m missing something fundamental, but the exact same code is possible with just the minimal AbstractPath + Path{AbstractString} in base.

  • Base defines these types, and all FS functions accept AbstractPath
  • A package MemoryFiles.jl defines MemoryPath <: AbstractPath
  • Profit

Moreover – this is kinda-already-possible with current Julia, with the caveat that many filesystem functions in Base and packages are restricted to Strings, so a lot of them need to be reimplemented for the custom path type. IME this is the only major complication when defining a custom file/path type.

1 Like

Perhaps I also didn’t entirely understand your question: I read it as why not have a path type that’s just an AbstractString, but it sounds like you might be more asking why it’s worth having handles as first-class objects?

Sukera’s pointed at one of the main reasons, and there are a few others, like having more of a structural split between filesystem-interacting and pure path operations.