Designing a Paths Julep

As I think I said earlier, I’m happy with / myself, but since this seems like such a big bikeshed I’m somewhat tempted to push this particular question off into the future and adjust the initial plan to not feature any path joining operator so it doesn’t block the rest of the discussion/functionality. Adding a operator for this later is fine IMO, so long as it’s not a pain currently, and path joining via interpolation (p"$a/$b") is something we want anyway.

10 Likes

S3 paths are their own thing, so while I think not making trailing separator handling part of the general API is a consequence of this (it isn’t currently), I don’t think this actually has any bearing on the implementation for file paths.

I’m familiar with some of these, rsync immediately pops to mind. At this time, I don’t actually think the behaviour of these tools supports allowing an optional trailing slash. If anything, knowing that there never will be a trailing slash in a Path lets you consistently add one when the tool cares without needing to check for an existsing slash, doubling up, or exposing the weirdness (I feel comfortable claiming the way rsync handles trailing slashes is weird and frequently catches people out)

I can’t help but think an S3Path <: AbstractPath is the correct answer here. I’ve deliberately structured the GenericPlainPath code in the prototype to make implementing non-filepath but similar types easier.

Similarly to S3, I think a seperate URI subtype would make sense. Like S3, they’ve got other differences too, as Julius notes.

2 Likes

Maybe a discourse admin could split this thread in two - one thread for comments on infix operators for paths, and one for “everything else”? This thread is like 90% about the operator so I’m guessing people who don’t read this specific comment will just keep replying to the other stuff.

(And people can mute one thread if they so desire)

2 Likes

Apology for replying without reading whole thread. But, I’m worried of the following:

Especially, since the new standard will be p"...".
The simplest would be a common interface to as-close-as-possible replication of the various previous standards i.e. WindowsPath, LinuxPath etc. with the ability to cross-operate across standards.

Again… didn’t read the discussion, but I did add a nice cartoon :person_shrugging:

2 Likes

Here are some random comments (I think I made some of them somewhere before):

  • I think a base path type should be able to store and round-trip any path unnormalized that is considered valid by the platform. So, on Windows, something like C:\foo/bar\something.txt is valid, and I think a path type needs to be able to store that and then be able to reconstruct as that string if I want to. Why is this important? There are lots of scenarios where one is given a path from some other system and where the path essentially functions as a key, and a general path type should IMO be able to support that scenario. Of course, one can then always normalize a path to get it into a nicer form, but I think that should be an explicit operation, not part of the type that stores a path.
  • I think I would try to keep this very simple. For example, just create a type for a filesystem path, not a system that can accommodate any type of path. If down the road someone wants to support an S3 path, they can just create a new type, add some methods to existing base functions and be done. Is there really a need to create a type-hierarchy now that would accommodate potential future types? Maybe there is, but I don’t really see the use case at the moment.
  • Another question where I’m not sure is whether it is actually good to have different concrete types for Windows and Posix paths, or whether it would be better to just have one type with some boolean flag or something like that. There are scenarios where one will want to process paths from both platforms, and it might be nicer to have just one type, so that e.g. arrays can use a concrete element type.
  • I think URIs need to be a completely different thing, they have their own rules and cramming that into one type-hierarchy I think is not going to make things easier.

I’ve been playing with a URI and path type for the language server in this spirit for a while, and right now the path type is really just this: PathsAndURIs.jl/src/types.jl at aa0acecfe1366b429563ba8ebcf1f1b4877576ce · davidanthoff/PathsAndURIs.jl · GitHub. I haven’t started integrating that into the LS, but at least right now I think this might be a good way to handle things. The main benefit it has is that 1) it allows dispatch on paths, 2) it can faithfully story any system path unaltered, 3) it has the additional platform flag that we are missing at the moment when we store paths as pure strings.

9 Likes

tldr; There’s nothing functionally wrong with joinpath and a new implementation of paths can add something down the road that forwards to joinpath, so it would be nice to focus on things that need to be decided now.

Using an infix operator comes up every time reworking paths is discussed and it’s not just a persistent vocal minority. joinpath(paths...) only adds a 8 letters and 2 parenthesis to a large path join, so it’s not like it’s not adding a huge burden to writing or reading code. I find it difficult to believe the near universal desire to have an infix operator is due to anything other than people wanting construct paths the same as they are visually represented. Therefore, if we add an infix operator it should just be Base./. If people are really near conniption at the thought of that, then just don’t add a new infix operator.

2 Likes

Another reason I agree with this is that I can imagine wanting to “string concat” and “path concat” in one go.

For example, if I wanted results/scenario$n/output.csv to contain results of all n scenarios of a simulation, with every scenario in a different folder, I might write code like this:

root = "results"
directory = "scenario"
file = "output.csv"
n_scenarios = 10

for n in 1:n_scenarios
    csv = p"$root/$directory$n/$file"
    write_results(simulation, csv)
end

It’s not immediately obvious to me what would happen if I were do to do this instead:

    csv = p"$root$directory$n$file"

I’d want results/scenario1/output.csv and not results/scenario/1/output.csv.

Explicitly requiring / in the string macro makes it less surprising imho.

3 Likes

For clarity, this is the current behaviour. Doing p"$root$dir$file" where all components are Paths will result in an error.

2 Likes

Hi David, you made most of those comments to me on Slack. I’ve still got similar thoughts to those I replied to you with on Slack, so appologies for the repetition.

I’m still not convinced this should be a supported use case. If you want an arbitrary sequence of characters to be used as a key, a Path type doesn’t sound like the most appropriate choice to me. Beyond dispatch, many of the benefits form having a path type come in the form of extra semanticics, and I don’t see nearly as much benefit in a path type without them.

The approach taken in the prototype is just implementing a filepath type + system. It merely occured to me along the way that in wanting to create Windows/Posix path types, it only took minimal effort to make a more generic/reusable implementation. This PR isn’t in danger of adding an S3 type any time soon, but I does seem nice to me that there’s a sensible supertype that may be used.

I’m not sure what’s best here, but it does seem nice to me that you can write filesystem methods for the current system with ::Path. Otherwise all filesystem-interacting methods need to have a “is this path of the same kind of the system” error-path, and at a glance this is a stronger argument for making the system part of the type information than mixed system/non-system path arrays (unless I’m missing a major use case for mixed arrays?).

2 Likes

I must disclaim that I have only skimmed through the conversation above, but I thought a few comments would be in order since I have worked on this before and I was pinged.

When I was looking into this the main concern was having a type that is extensible enough to support remote file systems. To make matters worse, because of the unfortunate history of the internet and “cloud computing”, “remote filesystems” nowadays often means “S3” which is not actually a file system (i.e. the underlying data structure is merely a dictionary rather than a tree). This presents the dual challenges of dealing with both synchronization of the file system state and defining a minimal file system like data structure for key-value stores like S3. Suffice it to say this was all pretty awful, and I think realistically if something is going to be put in Base, the idea that it will have clearly defined extensions for remote file systems is not realistic. I think the best you can do would be to try to impose as little as possible so that someone who is determined to extend it this way maybe has a chance (more likely it will require a completely new type tree).

The other comment I have about my experience looking into this is that, since the file system is a tree, I think for extensibility it is really important to be able to run generalized tree algorithms on the path types. This may mean maybe not constantly querying the FS state from the OS, although it may also be hard to come up with design constraints without specific use cases. I did some work on AbstractTrees.jl, from which I conclude that generalized tree stuff is pretty hard in Julia not least because of type stability. Clearly if this is in Base there will not be a tree library, but I would think that a good design would be trivial to plug into one.

As for / as a joinpath operator, without having thought about it too carefully I think I am against this. The problem is that the division operation which is also represented by / has radically different semantics from joinpath, which if anything is most like *.

6 Likes

Regarding tree operations I would highly recommend that the ::AbstractPath defines an interface for a “tree mapreduce” for path types that aggregates over existing files and directories. (I settled on this type of design for the tree/graph types in DynamicExpressions.jl after 3 years of continually refining it to get more and more code reuse and less maintenance burden in SymbolicRegression.jl, and it has been extremely useful: Utils · DynamicExpressions.jl. The code is here.)

I find this sort of design really useful because you can generate almost every aggregation/filtering operation you would want from it! And it can be made completely type stable.

This means that given some new data structure (such as S3, say), you would only need to define how a path_mapreduce would work, and then you instantly have access to 95% of the filesystem aggregation operations on ::AbstractPath.

The syntax would be roughly:

path_mapreduce(
    f_file::Function,
    f_dir::Function,
    op::Function,
    root::AbstractPath,
)
  • f_file is applied to each file / leaf
  • f_dir is applied to each directory path (no children)
  • op is the reduction - it takes two arguments: (1) the result of f_dir and (2) a vector of results from over all children (either the result of f_file or themselves the result of op)

For example, to count the max depth of your file system, all you would need is:

function get_depth(root::AbstractPath)
    path_mapreduce(
        f -> 1,
        d -> nothing,
        (_, children) -> 1 + maximum(children),
        root
    )
end

Again, if someone just defines a path_mapreduce for a new AbstractPath type, they could instantly access to all of these utility functions. (Which users might use as well - which would let the ecosystem be more generic). If children is a generator function, you could even make this a zero-allocation function.

Another example, to count files over a given size:

function num_over_size_n(root::AbstractPath, n::Integer)
    path_mapreduce(
        f -> get_file_size(f) > n ? 1 : 0,
        d -> nothing,
        (_, children) -> sum(children),
        root
    )
end

Or, finding the largest file:

function find_largest_file(root::AbstractPath)
    path_mapreduce(
        f -> (f.name, get_file_size(f)),
        d -> nothing,
        (_, children) -> reduce(
            (x, y) -> x[2] > y[2] ? x : y,  # Find the largest file by comparing sizes
            children,
            init = ("", 0) # Initialize with an empty name and size 0
        ),
        root
    )
end

Or, literally just collecting all path names:

function collect_paths(root::AbstractPath)
    path_mapreduce(
        f -> [f.name],
        d -> [d.name],
        (parent, children) = vcat(reduce(vcat, children), parent),
        root;
        result_type=String,
    )
    # (It's easy to make a preallocating one too!)
end

Here we have provided the result_type explicitly for type stability (in case of empty directories).

In DynamicExpressions.jl for tree mapreduce I then derive a bunch of utility functions on top of this for simple collection-like aggregations. For example, Base.count for evaluating a user-provided condition over all nodes, Base.sum, Base.hash, Base.filter, and so on. It’s really useful to have a “collection-like” view that imposes a given traversal order. If you don’t want to impose a default, you could have each of these wrap the ::AbstractPath type in the traversal order, like DepthFirstTraversal(::AbstractPath) before passing it to the collection functions in Base (default choice could perhaps be a trait for a path type). I think these would also be super useful for path types, now that I think about it. Imagine operations like

filter(f -> occursin(".txt", f.name), root)
count(f -> get_file_size(f) > 2^20, root)

And these reduced operations would all just work on a given ::AbstractPath, so long as the user has defined the path_mapreduce for their path type, because things like count would already be defined as:

function count(
    f::F, tree::AbstractNode; init=0,
) where {F<:Function,BS}
    return path_mapreduce(
        t -> f(t) ? 1 : 0,
        t -> f(t) ? 1 : 0,
        (parent, children) -> parent + sum(children),
        tree;
        result_type=Int64
    ) + init
end

(Multiple dispatch FTW)

This can also be made compatible with symlinks via another option that defines any special behavior if necessary for encountering a parent twice (with caching if desired). See the tree_mapreduce docs above for how I handle it there for graph-like structures.

10 Likes

(Reposting my Slack comment re relative paths.)

I had an idea, inspired by @ExpandingMan’s decision in FilePaths2.jl to exclude relative paths.

The idea I like is having two types: Path which is always an absolute path and PathSegment which is an offset that can be concatenated to an Path or to a PathSegment, just like Year can be concatenated to a Year or to a Date .

I have two basic reasons.

State. Current working directory is global mutable state; I want to avoid a situation where runtime behavior depends on it. So if p"foo" expands to Path("/home/user/src/proj/foo") at compile time, then cd won’t affect the target as it’s already resolved to an absolute path.

Semantics. Path segments are common in real programs, and are semantically different from absolute paths. Path("/etc/hosts") has a lot of properties that PathSegment("foo") doesn’t have. Calling any filesystem methods on a path segment never makes sense and can only give wrong answers, which is a good indicator there are two different types living here.

Likewise there is often uncertainty over whether "foo/bar" is supposed to be relative to the current working directory or will be concatenated with a base path. For example joinstring("/foo", "/bar") == "/foo/bar" but joinpath("/foo", "/bar") == "/bar" which can be confusing but could be made to fail at compile time, as two absolute paths cannot be concatenated and relative path segments won’t override absolute base paths.

Separating the types simplifies a library author’s job since they know exactly what kind of object they receive from a caller, and likewise communicates to users what kind of object is expected from them.

3 Likes

Thanks for this really interesting comment. I originally thought it wouldn’t be worth adding a children function to the API, but I think that’s all that’s needed for the path mapreduce you describe, and your examples are rather compelling. I’ll play around with this and consider adding children to the proposed API.

2 Likes

Exactly this idea occured to me (except with AbsolutePath and RelativePath types), the issue I had is that I can’t see a way to have this and platform-secific types without multiple inheritence. To elaborate, the problem is that we then have:

<: AbsoluteSystemPath <: RelativeSystemPath
<: PosixPath AbsolutePosixPath RelativePosixPath
<: WindowsPath AbsoluteWindowsPath RelativeWindowsPath

Perhaps one could put absolute/relative-ness in a boolean type parameter? That might work but seems a but awkward to me.

Example (untested) implementation
abstract type SystemPath{isabs} <: PlainPath end

const AbsolutePath <: SystemPath{true}
const RelativePath <: SystemPath{false}

struct PosixPath{isabs} <: SystemPath{isabs}
    path::GenericPlainPath{PosixPath{isabs}}
end

const AbsolutePosixPath = PosixPath{true}
const RelativePosixPath = PosixPath{false}

struct WindowsPath{isabs} <: WindowsPath{isabs}
    path::GenericPlainPath{WindowsPath{isabs}}
end

const AbsoluteWindowsPath = WindowsPath{true}
const RelativeWindowsPath = WindowsPath{false}

My main concern with this design is the amount of potentially unstable path code that would result from it (e.g. parsing a path).

2 Likes

I see a significant difference between the PathSegment and RelativePath ideas, namely that PathSegment isn’t a path: it doesn’t support any filesystem API like open, unlink; only Path (absolute path) supports those. There is no relative path type.

Any function that does platform stuff needs to take an absolute path, not a PathSegment. This is good because path resolution should happen as close to the “outside” of the program as possible.

The Path constructor uses the working directory as needed: Path("a") === joinpath(pwd(Path), PathSegment("a")). The p_str macro expands to an absolute path: p"a"::Path === Path("a") === joinpath(pwd(Path), PathSegment("a")).

For parsing, always specify parse(Path, s) or parse(PathSegment, s).

PathSegment("/home") is either illegal or equivalent to PathSegment("home"), so that join(Path("/tmp"), PathSegment("/home")) === Path("/tmp/home"), not Path("/home"). In conventional libraries like os.path and Pathlib, joinpath("/tmp", "/home") == "/home". I think this conventional result cannot be attained under my proposal (which might be a good thing — I’ve seen a number of experienced people surprised by that).


I think of two common categories of relative-path use cases:

(1.a) a relative path is explicitly joined onto an abspath, intended to be a subdirectory of that abspath but (1.b) may override it absolutely or through .., in extremis. This latter case (1.b) is the source of innumerable security vulnerabilities.

(2) a relative path is implicitly joined onto the current working directory. This functionality depends fully on implicit global mutable state, so is probably not “best practice”.

It’s (1.a) – explicitly joining onto an abspath, i.e. the “good” use case of relative paths – that PathSegment is intended to support.

Personally, I don’t see much of an issue with opening a relative path, and I don’t think p"" should bake in the CWD — that’s too much behind the scenes magic for me. It’s likely we have different experiences, but I haven’t seen much of an issue with over-reliance on relative paths that needs to be addressed.

Other than that, I could be brought round to a “partial path type/variant” (whether you call it PathSegment or RelativePath), I just feel the implications need further exploration/testing.

1 Like

Under your model of relative path,

  • which constructors can p_str expand to: absolute only or also relative? What is p"a"?
  • what does joinpath(AbsolutePath("/tmp"), RelativePath("/home")) evaluate to?

Both, p"a" would produce a RelativePath("a").

AbsolutePath("/tmp/home")

So then it would have this?

a = AbsolutePath("/tmp")
b = RelativePath("/home")
@assert resolve(a) === a
@assert resolve(b) === joinpath(pwd(AbsolutePath), b)
@assert joinpath(a, b) === AbsolutePath("/tmp/home")
@assert joinpath(resolve(a), resolve(b)) === 
    AbsolutePath("/home") !== 
    resolve(joinpath(a,b))

I think I’d lean away from RelativePath if it produces that inequality at the end — the added complexity wouldn’t be buying enough, relative to the conventional single path type.

An advantage of PathSegment is that this doesn’t happen: there is no PathSegment("/home"). The single unified type isn’t quite as cool as PathSegment imo but it is conventional and familiar so I’d be fine with that alternative.

Similarly, there is no RelativePath("/home") only RelativePath("home"), but other than that I think all your assers would hold, including at the end that AbsolutePath("/home") != AbsolutePath("/tmp/home").