[ANN] FileTrees.jl -- easy everyday parallelism on trees of files

Very cool! Reading the (excellent) documentation and https://github.com/shashi/FileTrees.jl/issues/9 (@c42f), I’m thinking a bit of the roles of the various packages to form a nice distributed array ecosystem, and relating it to for instance https://dask.org/.

I’m interested in a related use case, namely large chunked array datasets, commonly used in fields like climate science. In dask there is dask.array. There are different packages that offer IO for these chunked datasets, e.g. Zarr.jl, HDF5.jl, NCDatasets.jl, ArchGDAL.jl. In the case of Zarr.jl these chunks can each be separate files, but often the chunks are internal to the files. To help reason about these chunks within AbstractArrays @fabiangans created DiskArrays.jl, see [ANN] DiskArrays.jl, and we’ve been busy integrating this into above mentioned packages.

It would be cool to be able to easily put these data sources into a form where we can use Dagger.jl to apply scheduled operations over chunks, and also keep track of the positions of chunks in the dataset, for operations that go over chunk boundaries. I’m aware there is Dagger.DArray but not sure of it’s place or future, it is a bit underdocumented. Is Dagger.jl going towards primarily being a scheduler?

3 Likes