[ANN] FileTrees.jl -- easy everyday parallelism on trees of files

DrChainsaw · August 20, 2020, 7:25am

Thanks for the reply,

Yeah, I agree this is somewhat out of scope as it has more to do with the insides of the files and perhaps more in line with the dataset stuff @c42f mentioned. I decided to bring it up just because I reckon logfiles like the ones I have are spit out by many kinds of systems and filetrees seems like a very good fit for helping to analyze it.

To add some more context to the problem, the files don’t have a header describing the contents, nor do they have any exploitable internal structure. They are just textfiles and one has to parse them line by line to find out what dataframes to create. The main use case for me is data exploration so the time it takes to parse basically far outweighs the time for any processing afterwards.

I have experiemented with just having a set of possible file_lines and initialize each logfile as something like maketree(name=filename, value=delayed(parsefile)(filename)) => dummyinits) where dummyinits is just an array of (name = "somefileline", value=delayed(nothing)). I have not yet gotten this to work as I get an array out of bounds in compute (will post an issue once I have made sure error is not on me). Even if I do get it to work it seems quite painful to dig through the Thunks and see if there is a nothing (or NoValue) in there to replace or if one shall just wrap it another Thunk when mapping, especially since the value to replace with is produced by the parent.

Another somewhat painful option I could think of is to set all the values to the same instance of some mutable lazy struct which parses the whole file the first time one tries to access some key from and blocks if one tries to access when parsing. If I come up with anything which could be useful I will make an issue about it as well.

The inserttree solution is indeed a bit overengineered as it allows for the parsing function to decide based on the contents of the file whether it needs to be a new subtree or if there is just one value (i.e a single dataframe was produced).

Topic		Replies	Views
The ultimate guide to distributed computing Julia at Scale parallel , cluster , distributed	44	10335	June 21, 2021
Storing and accessing large jagged array with julia General Usage question , data , filesystem , hep	33	4311	October 31, 2023
[ANN] Parquet2.jl Package Announcements data , parquet , tables , serialization	20	7713	May 8, 2024
Simple Parallel Examples for Embarrassingly Simple Problems Julia at Scale	29	7571	April 23, 2021
Reading and processing Data files concurrently Data parallel	18	3930	September 20, 2017

[ANN] FileTrees.jl -- easy everyday parallelism on trees of files

Related topics