[ANN] FileTrees.jl -- easy everyday parallelism on trees of files

That’s really weird, I have no idea what’s going on off the top of my head.

Is there a FileTrees entry in your Manifest.toml file? What does it look like?

Just as a fair warning(?), I don’t think halve is tested outside of my packages yet. But if you don’t mind giving a shot at this, it would be fantastic!

I think first we should make Dagger a consumer of this API. It’d be amazing to try distributed LazyGroupBy on a bunch of files. I think that’s a solid example to try and implement.

cc @piever

1 Like

Yes, out-of-core group-by sounds like a nice application. I think it shouldn’t be too hard to implement reduce with Dagger API. I might try this myself at some point.

1 Like

Hehe, “forceful” is really the right word to decribe it :slight_smile:

Skimming through the files first might be an option as well although it might end up being to slow. I was also a bit worried that reading the same file from multiple workers would make them block each other, but maybe it doesn’t work that way?

1 Like

I had some issues to get distributed working as well. Is there a chance that CSV works because it is in your shared/default environment?

I could get it to work using the method posted here: Packages and workers

I think that the flags you used should be equivalent but I don’t know enough about it to say for sure.

2 Likes

Thanks @DrChainsaw! Your’e right.

1 Like

Thanks for the answer @shashi. The problem was that I needed to explicitly activate the environment for all workers. Now my script runs, but I still don’t see a parallelization.

Consider the following script, test.jl

@everywhere using Pkg
@everywhere Pkg.activate(".")
@everywhere using Distributed, FileTrees, .Threads

@show nthreads()
@show nprocs()

@everywhere function create_tree()
    t = maketree("test_file_tree" => [])
    for  c in 'A':'Z'
        node_file =  joinpath(string(c), "nodefile")
        t = touch(
            t, 
            node_file,
            value=1
        )
    end
    t
end

t = create_tree()
FileTrees.save(t) do file
    println("pid : $(myid()), threadid : $(threadid()), $(path(file))")
end |> exec

Running julia 1.5, FileTrees 0.1.2, with JULIA_NUM_THREADS=4, I get

> julia -p 2  test.jl
 Activating environment at `~testtree/Project.toml`
      From worker 3:	 Activating environment at `~/testtree/Project.toml`
      From worker 2:	 Activating environment at `~/testtree/Project.toml`
nthreads() = 4
nprocs() = 3
pid : 1, threadid : 1, test_file_tree/A/nodefile
pid : 1, threadid : 1, test_file_tree/B/nodefile
pid : 1, threadid : 1, test_file_tree/C/nodefile
pid : 1, threadid : 1, test_file_tree/D/nodefile
pid : 1, threadid : 1, test_file_tree/E/nodefile
pid : 1, threadid : 1, test_file_tree/F/nodefile
pid : 1, threadid : 1, test_file_tree/G/nodefile
pid : 1, threadid : 1, test_file_tree/H/nodefile
pid : 1, threadid : 1, test_file_tree/I/nodefile
pid : 1, threadid : 1, test_file_tree/J/nodefile
pid : 1, threadid : 1, test_file_tree/K/nodefile
pid : 1, threadid : 1, test_file_tree/L/nodefile
pid : 1, threadid : 1, test_file_tree/M/nodefile
pid : 1, threadid : 1, test_file_tree/N/nodefile
pid : 1, threadid : 1, test_file_tree/O/nodefile
pid : 1, threadid : 1, test_file_tree/P/nodefile
pid : 1, threadid : 1, test_file_tree/Q/nodefile
pid : 1, threadid : 1, test_file_tree/R/nodefile
pid : 1, threadid : 1, test_file_tree/S/nodefile
pid : 1, threadid : 1, test_file_tree/T/nodefile
pid : 1, threadid : 1, test_file_tree/U/nodefile
pid : 1, threadid : 1, test_file_tree/V/nodefile
pid : 1, threadid : 1, test_file_tree/W/nodefile
pid : 1, threadid : 1, test_file_tree/X/nodefile
pid : 1, threadid : 1, test_file_tree/Y/nodefile
pid : 1, threadid : 1, test_file_tree/Z/nodefile

Moreover, if I comment out the value=1 line in the tree constructor, nothing is printed out. It the save loop seems to skip nodes without values. Is that intentional?

If these are issues with FileTrees I can create issues there.

1 Like

Ah. I used save and not load with lazy=true. Works now. Sorry for the noise.

1 Like

Forgive me for saying this. FileTrees looks great. It assumes a POSIX compatible filesystem (I think). Systems admins like me always howl when people read and write many small files - I suppose we should really suck it up and start to engineer for it.

Is there any though it Julia for using S3 object stores?
I guess that would be a completely separate module from FileTrees.jl

1 Like

Thank you, @shashi. This is a great and useful tool.

I write a not very small example, because I have a doubt.
I have been working using it with some ML preprocess in which files has a certain structure “test//”

    # All files
    all = FileTree(dataset_dir)

    # Apply the function apply to each file
    data = FileTrees.load(all; lazy=true) do file
        # Get name
        fname_str = convert(String, path(file))
        # Apply the model to the image
        apply(fname_str)
    end

    # Recover the categories
    categories = FileTrees.load(all; lazy=true) do file
        # Get name
        category = path(file).segments[end-1]
        category
    end

    for type in ["test", "train"]
        # I love that part, that you can so easily filter train and test files
        sel = GlobMatch("$type/*/*.$ext")
        values = reducevalues(hcat, data[sel]) |> exec
        cats = reducevalues(hcat, categories[sel]) |> exec
        writedlm(..., values)
        writedlm(..., cats)
    end

It is right? For me it is working well, but I do not know if the sort
of categories and data are right even in parallel way. It could be run in parallel without problem.

1 Like

@tkf nice! I’d be down to pair program on that, it might be much quicker to do it together than our own.

@johnh
I’m looking for users on all platforms! If you give it a go, let me know.

Right now we actually don’t tie into POSIX, by means of FilePathsBase.jl … I think the idea is depending on which platform you’re on, the path will be of a different type. But since I don’t have a good use case to test it on, https://github.com/shashi/FileTrees.jl/issues/8

I think at least we would need to change how path works. https://github.com/shashi/FileTrees.jl/blob/master/src/datastructure.jl#L288

possibly path(parent(d)) / Path(d.name) should become path(parent(d)) / d.name and we store the right path type in the root node as stated in issue #8.

I’m actually not sure if there’s an AbstractPath implementation for S3 that is open source and based on FilePaths.jl cc @RoyiAvital (sorry for pinging :slight_smile: )

@jonalm Thanks! That makes sense! We should definitely document that.

skip nodes without values. Is that intentional?

That’s right! But should I’m open to doing the “right thing”. Please do open an issue where we can discuss.

@dmolina that looks good!

You can also try name(parent(file)) to get the category instead of reaching into the path segments…

I think lazy in the second case of getting the categories is overkill, I’d just skip that. (because you don’t need to run that in parallel).

I don’t think reading the same file will block workers. That would require a lockfile :wink:

1 Like

@shashi Here is a super quick hack at it to implement Transducers.jl API (including FLoops.jl) on top of Dagger.jl: https://github.com/JuliaFolds/DaggerFolds.jl. I have no idea if it is practically usable though.

Also, I realized implementing this and avoiding boilerplates requires importing a bunch of internals from Transducers.jl. I need to think about a better interface to do this.

2 Likes

DaggerFolds looks great!! I see some comments there # Not sure how to cancel delayed computation.

I will take a closer look over the weekend.

Thanks! BTW, if you don’t mind going down the big rabbit hole of cancellation, I recommend checking out https://github.com/JuliaLang/julia/issues/33248 :slight_smile:

1 Like