[ANN] FileTrees.jl -- easy everyday parallelism on trees of files

visr · August 15, 2020, 12:06pm

Very cool! Reading the (excellent) documentation and https://github.com/shashi/FileTrees.jl/issues/9 (@c42f), I’m thinking a bit of the roles of the various packages to form a nice distributed array ecosystem, and relating it to for instance https://dask.org/.

I’m interested in a related use case, namely large chunked array datasets, commonly used in fields like climate science. In dask there is dask.array. There are different packages that offer IO for these chunked datasets, e.g. Zarr.jl, HDF5.jl, NCDatasets.jl, ArchGDAL.jl. In the case of Zarr.jl these chunks can each be separate files, but often the chunks are internal to the files. To help reason about these chunks within AbstractArrays @fabiangans created DiskArrays.jl, see [ANN] DiskArrays.jl, and we’ve been busy integrating this into above mentioned packages.

It would be cool to be able to easily put these data sources into a form where we can use Dagger.jl to apply scheduled operations over chunks, and also keep track of the positions of chunks in the dataset, for operations that go over chunk boundaries. I’m aware there is Dagger.DArray but not sure of it’s place or future, it is a bit underdocumented. Is Dagger.jl going towards primarily being a scheduler?

Topic		Replies	Views
The ultimate guide to distributed computing Julia at Scale parallel , cluster , distributed	44	9903	June 21, 2021
Storing and accessing large jagged array with julia General Usage question , data , filesystem , hep	33	4021	October 31, 2023
[ANN] Parquet2.jl Package Announcements data , parquet , tables , serialization	20	7421	May 8, 2024
Simple Parallel Examples for Embarrassingly Simple Problems Julia at Scale	29	7344	April 23, 2021
Reading and processing Data files concurrently Data parallel	18	3798	September 20, 2017

[ANN] FileTrees.jl -- easy everyday parallelism on trees of files

Related topics