The case for Path as a fundamental type that belongs in Base Julia, and a proposal for implementing it well.
Along with the “Design goals” section makes the scope fairly clear, no?
This is still in the “being written” phase, so specific suggestions to make it clearer would be good
Discussion of the abstract type is currently just the “High level path interface” subsection.
To make that work nicely, an abstract filesystem interface would probably be nice. This would be highly complementary but distinct from the abstract path interface. It’s not something I plan to investigate at this time, but perhaps you’d like to work on it?
I still don’t get what exactly is the difference between the two of
in the Julia context.
But if it’s clear to everyone else then of course feel free to keep it as is.
Representing anything beyond plain filesystem paths is a nontrivial concept, you may want to consider adding at least an explicit mention.
I’m mostly fine with the current Julia interface, although would prefer non-string-based default path implementation if it existed.
Consider me a user of path interfaces (current or future), not a developer of those
There are a small number of operations one can do on a path not involving the filesystem (e.g. what is the parent directory, base name, extension, etc.), and then a larger pile of operations that involve the filesystem (e.g. touch, mv, cp, etc.). This proposal only involves an interface for the former.
I see the problem then with the parent function that can return nothing. One can return an empty object instead. Then construction with joinpath no longer becomes ambigious.
I don’t see the benefit from deprecating joinpath, dirname, basename and splitext functions when working with path objects. By default I expect that we can sprinkle some p string macros and incrementally transition by returning Path from pwd, @__DIR__, @__FILE__ and other objects and expect the written code to work without changes.
There’s no reason why joinpath, basename etc. wouldn’t work with Paths (in fact, basename is one of the proposed AbstractPath interface functions). I’m just that saying that “the parent of "a.txt" is ""” seems a bit dodgy to me, when really there is no parent. It also means that you have paths where dirname(p) == p which IMO doesn’t make sense.
Regarding an incremental transition, pwd, @__DIR__, and @__FILE__ will need to continue to return String paths until Julia 2.0, as this is a breaking change. This doesn’t stop us from supporting Paths in key functions like open along with other Filesystem functions (touch, mv, cp, etc.), and we can push for the use of Paths over strings with these methods.
Yup, and similarly we could add cwd() -> Path as a pwd() analogue (which IMO is a misnomer, as it doesn’t “print” at all, so it’s just a vestigial shell-ism that reminds me of car and cdr in Lisp-land), and then once the path-based system is well established start @depreciateing the String-path methods and encouraging people to use the relevant Path based one.
It’s a bunch of code churn, but:
Old string-based code won’t stop working, and this shift can happen very gradually at whatever pace package authors are comfortable with
There’s no easy way to avoid code churn, particularly when it comes to updating methods to accept ::Path arguments
We can make it easier by having good depreciations (NB: I think @depreciate could do with some work, IMO Emacs Lisp does it better ATM and is a notable example as it’s been highly backwards compatible for decades).
Making the most of multiple dispatch, new Path methods for common functions will let a String → Path transition propagate through a call chain with minimal changes.
There is a school of thought that argues that returning an empty collection is better than returning a null value. If we look at a path as a list of strings, I see that the path fits in this category well.
One of the benefits is that error messages can be made more informative. For instance, consider the error message that is thrown to the user forparent(parent(p“a.txt”)). If we would return nothing, the best we could do is throw that the method does not exist. However, if we had an empty path passed to the parent, the error could have been made clear without introducing the parent(::Nothing) method.
Regarding the dirname(p) == p, we should throw an error at dirname for this reason. As I see it, one of the opportunities for introducing the Path type is to make existing code safer and make errors and warnings thrown early when the path is constructed instead of when system operations are executed.
Regarding an incremental transition, pwd, @DIR, and @FILE will need to continue to return String paths until Julia 2.0, as this is a breaking change.
If the whole ecosystem continues to work as before when the Path type is returned instead, then I don’t see why we need to consider this a breaking change that requires Julia 2.0. Instead, I see this as a design constraint that can be addressed thanks to the flexibility of multiple dispatch. This also puts one on a realistic footing regarding whether the API is practical and does not require dirty conversions between path types or conversions to string and back.
I would want to see an analysis of each breaking change on how often a given pattern is used and what is a better alternative instead of doing a Platonic design here. For me, the new path system should fill the following design requirements:
We shall have AbstractPath type that has a well-defined API
The errors are thrown at the path construction time rather than at its use of executing methods
The existing code shall work in as many instances as possible if pwd, @__DIR__, and @__FILE__, joinpath would return ::AbstractPath type instead of String
The string macro p”...” which I consider well considered in your proposal for some better usability
Another design constraint is performance. It should be possible for users to write high-performance code using the Path type and, to some extent, nudge users to adopt better practices.
In my mind paths are something where we are providing an abstraction in Julia for something that is defined by the OS platforms we run on, not a green room “what would an ideal path system look like”. In those cases, I think it is crucial that we stick to whatever (maybe horrible) platform design choices the folks that built the OS made, otherwise we run into the classical leaky abstraction problem: if we add a path type to Julia base that only can represent a subset of what is considered a valid path by the OS (or one that normalizes things in a way that is useful sometimes, but not always), well, then folks won’t be able to use that path type in all situations where a path is needed, and now we made the situation actually worse for them relative to not having anything: now one constantly needs to convert between path and string types, potentially store both etc.
I also think that base should (to the extent possible) not make calls a la “I think use-case A is valid for paths, but use-case B I don’t consider useful, so I’m not going to support it”. I think the right approach here is to try to support anything that one can do with whatever the OS considers a path. Someone will always want to do that!
With this case of “we enforce normalization” I think it is not even close, the case of handling and dealing with non-normalized paths is not a niche situation, there are so many situations where one gets a path (from an API, from a file, whatever) and just wants to preserve it, I think a base path type really needs to support that.
I also actually don’t really understand what features wouldn’t be possible if non-normalized paths can be stored as such? Is there a concrete example of “if we support this, then we can’t provide feature X”?
Finally, I am not saying that a path type that enforces normalization is not super, super useful in certain situations! It obviously is. But it strikes me as a more specialized scenario. Maybe something that could be handled in a NormalizedPaths.jl package, I don’t really think that needs to be in base.
I don’t really think having an error path that checks whether a path is a valid path on the current system is so bad? Something like is_current_platform could be part of the path interface, and then calling that would just be part of the code at the beginning of a function that checks arguments for valid values. That seems a very common pattern?
The upsides to me seem really significant: if we use heterogenous types, then any code that processes paths from multiple platforms will run into lots of dynamic dispatch problems, and those are really hard to get around. And that again doesn’t strike me as a niche scenario at all…
It sounds like such a path type wouldn’t be much of an abstraction anymore, but mostly a String called Path to have a dispatch type. While I get the argument that it’s annoying to have parts of OS path specs that you can’t represent faithfully with a path type, I personally value robust convenience functions that simplify the most common usage scenarios more. For example, I’m not sure if it’s bad if you can’t represent invalid paths even though they’re a subset of all paths that you can give to an OS to get a response from it. If that response is just always “this path is invalid” then what’s the point. You gave the example of dictionary keys above, but that again treats paths more like holistic strings and less like directions to something in a filesystem, where the directions you take and the target you end up at are more important than how the complete description of these directions looks (if there is more than one way to describe equivalent directions).
I get the unfortunate sense we’re simply coming at this from rather different directions, but to me this use case really just sounds like a string named Path… but I won’t repeat Julius’ points, which I think are put rather well.
The one extra comment I’ll make is that regarding:
This strikes me as an instance of “shotgun coding”, where a fragment of code needs to be injected into all sort of parts of a codebase and relies on the programmer “just remembering” to always get it right. I don’t like this on two accounts:
It relies on us fallible human programmers never forgetting to manually do the right checks in the right places.
It raises a run-time error and is thus much harder to statically detect, particularly in conditional code.
I don’t think that is correct. I’m not suggesting that we allow any string value to be stored in a Path, all I suggest is that we allow storing of unnormalized paths. In such a design I believe every structural point that was made above, for example that paths are collections of segments etc., still completely holds.
I think this is a false dichotomy, I don’t believe allowing for storage of unnormalized paths would mean that we can’t have robust convenience functions. All the API proposals in the design spec doc and that were discussed here in the thread would still be there.
I think my question in my previous post is still useful: is there an example of something that could not be done in terms of API or something else in the proposal, if the path type would allow storage of unnormalized paths? I don’t see it right now, but maybe I’m missing something.
Just to be clear, I am completely on board that a path type should not be able to represent invalid paths.
And I think if something wants to go into base it needs to accommodate lots of use-cases Otherwise I think a design is a great candidate for a package.
Even though this is an exaggeration as @davidanthoff points out, it doesn’t really feel fundamentally bad for me. It would already resolve the confusion whether one passes a path or a string with actual content. Anything else on top – any convenience or validation – are (useful) extras
Absolutely, past the brief AbstractPath section of the proposal, most of it is really just working out what’s worth putting on top. The WIP list (seen as subheaders) is:
Avoid representational ambiguities
Make invalid paths unconstructable
Cross-platform path construction syntax
Convenient prefixes
Platform-specific path types
Low overhead when invoking Libuv path methods
SubStrings as the type of path components
Iterable segments
Safe path interpolation
Just to be clear here, leaving normalisation aside, there’s no potential issue around not being able to represent a path on the system with the current design.
I do wonder whether it may be worth making an exception for the empty path though… which reminds me:
The empty path can’t be the neutral element, because the empty path can exist (with the right flags).
There are three key aspects that drive my enthusiasm for storing normalised forms:
Assuming a normalised form simplifies aspects of the implementation
Normalised forms lead to more sensible (IMO) behaviour in a number of respects
I liked Frame’s conceptual framework for paths, which is aided by normalised forms
To elaborate with some examples:
Knowing that there’s only one segment separator character used makes finding segments in a path as easy as looking for the previous/next instance of that character.
If we take to heart the “path as a series of directions” framing, it naturally follows that C:\a/b and C:\a\b should parse to the same path, the same way that the numbers 0.1 and 0.10 do — it’s purely a notational difference.
Even if using paths as say dictionary keys, would we really want /a/b and /a/./b to map to different values? If the answer is “yes”, it really does seem to me as though you’re wanting to interpret the externally-supplied “path” as an arbitrary sequence of bytes (String), rather than a direction to a location on the filesystem (Path).
Maybe I am just missing the obvious, in which case an example of a use case where preserving notational differences in Paths is important and you also want to use a Path not a String would help me get on the same page as you.
I also agree that we need to provide a convenient abstraction over paths, not simply emulate its implementation in the OS.
From a broader perspective, one of the worst aspects of the Unix philosophy is that data should not be structured, but instead be simply text streams - the ‘universal interface’. The implementation and abstraction of paths that Unix provides is an embodiment of this terrible philosophy. In Unix, paths are trees that pretend they are byte arrays. It’s also the same kind of thinking that has some C programmers forego string types in favor of byte arrays.
I think modern programmers have mostly converged on the idea that types are a good thing, not because they enable more functionality, but because they guide the programmer towards writing robust code. We should lean heavily into this ideal and try to make a path type that, through its behaviour, signals how best to use it.
That includes idea like automatic normalization.
However, there have been raised three counterpoints that limit the extend to which we can use the type system:
We need to support all existing path behaviours. That includes annoyances like invalid unicode paths, directories named ~, and symlinks
Julia is a high-performance language. That limits the amount of automatic operations we can do on strings (if these operations are slow), and also limits how we can store strings - e.g. we probably shouldn’t store it as an array of individual segments.
The point raised by @kevbonham is an important issue. Lots of the things we want to do with paths are actually string operations. Do we add lots of string operations to paths, or do we ensure that one can easily and efficiently construct one type from the other? We probably need to try both things out and see how convenient they are, in practice.
For the latter point, I’m leaning towards not implementation all the string operations, because
It’s still relatively easy to do Path(replace(String(path), "foo"=>"bar"))
Since Julia currently uses strings as paths, we need to continue supporting string anyway, so if users don’t want to opt-in to more type safety, they can keep using strings.
For users to DO want to opt into type safety, those users probably don’t mind a little back and forth conversion in order to gain better static guarantees (i.e. no sneaky string operations they didn’t expect)
It’s hard to consistently implement all string APIs for paths, so if we try, there will be a long period where it usually works like a string but then just randomly fails because someone forgot to implement that one stirng operation for paths.
Perhaps this is something that doesn’t adequately come across in the current document, but this is very much the way I’ve approached this effort.
I’m thinking that should we implement the functions like withname, withsuffix, etc. it should be similarly convenient over the status quo, but with added safety. E.g.
Since there’s a Path(::Vector{String}) constructor, and AbstractPaths are iterable, for whole-path replacements, you could also do:
Path([replace(seg, "foo" => "bar") for seg in mypath])
and it should just work, with the bonus that "bar" can be an externally provided value, but without side-stepping the default constructor values that have special meaning would raise an error. This helps avoid the security hole created with code like:
function makehidden(path::Path, oldname::String, newname::String) # less-than-useful MWE demo function
Path([replace(seg, oldname => ".$newname") for seg in path])
end
where somewhere upstream using makehidden ends up providing some sort of user/application/extenrally provided value as the newname argument, and the value "." is accidentally/maliciously passed. Equivalent code in Julia-of-today would cause a path segment to be replaced with .. and hence change the path beyond what was likely intended/anticipated/safe, but with the current Paths prototype this instead raises a runtime error, and you have to do things a little more awkwardly if you want to allow special runtime-provided values to be injected into a path.
There would not seem to be much loss if filenames were to be represented as strings as they are not far from inputs to the joinpath function in practice. So we could usebasename(::AbstractPath)::String and keep using splitext(::String)::Vector{String} function as before, except that one would need only add explicit basename in the code. One could add a compatibility method, splitext(::AbstractPath)::Vector{String}, which could give a warning but would still allow existing code to keep working.
It is hard, true, but that does not imply that we should not try. The new Path API can be shadowed on top of the Base exported methods, like using PathLib; @pathify where the @pathify macro can define shadowed methods that will be used instead. This enables running tests on registered packages with the modified API with a single regex replacement in the codebase module definition.
That’s not the Unix philosophy. The Unix philosophy is:
everything is a file
the utility of an operating system/programming environment should come from being able to compose solutions well, instead of from a single monolithic solution
Text being prominent is reasonable though, it’s human readable. The drawback is having to parse repeatedly, but often that’s not an issue. In any case Unix is perfectly compatible with binary data.