Working with a cluster: Data storage & JuliaDB workflow?

volkerkarle · October 11, 2021, 8:35am

Dear folks,

My Problem: Assume I work on a cluster spawning thousands of jobs (in parallel) and each job saves its outcome in some .hdf5 or .jld file. After all jobs have finished, I collect the data and put them in one folder, where I read individual variables and extract them to new files for further processing (all the data is too big to process it in memory). However, this data extraction takes a lot of time and I have the feeling that it could be done much simpler. I have seen how you can use JuliaDB to extract data online, but apparently this would also require some preprocessing of the variables.

What I want: I wonder if it’s an overkill, but I have the feeling the best solution to this problem would be to setup a running database service, which listens on some port and saves the data in some beautiful (possibly perfectly compressed) database. This would also bring about the advantage that one could check if something been calculated already and retrieve data from there.

What do you think, how do solve your “data storage” problems and what do you consider to be the best practice for big data?

Thanks,

v.

fgerick · October 11, 2021, 9:27am

Maybe your cluster is based on the nfs file system? This might be the bottleneck here. You could try to do as little read/write to nfs with lots of small files as possible, i.e. try to combine as much as possible to as large files as possible. Or do the post processing on another file system. Not sure how you could do that though in your application.

volkerkarle · October 11, 2021, 9:58am

You are right, it is NFS. Limiting the read/write is difficult, since the idea is to utilize as many cores in parallel as possible using many jobs, and every job needs to safe results somewhere.

fgerick · October 11, 2021, 10:09am

With some effort you could parallelize your code (e.g with Distributed or MPI) instead of spawning thousands of jobs. You could even just use Threads.@threads on one node to do nthreads() jobs on one node. You could then combine some results into one larger file instead of one file each. Otherwise you could at least limit yourself to reading from nsf / extract multiple files into one larger one!?

volkerkarle · October 11, 2021, 10:12am

That would be great, but if I do not spawn them individually, they will never start in the queue (they are already parallelized individually and take a bunch of cores).

fgerick · October 11, 2021, 10:15am

So you are starting thousands of jobs with each using lots of cores??? Who is funding your computational resources?

propelledanalytics · October 11, 2021, 10:19pm

Hi volkerkarle,

The SparkSQL.jl Julia package might be a fit for your use case. SparkSQL.jl helps Julia programmers work with very large datasets using Apache Spark and Julia. It supports saving data to the parquet format. Parquet has many compression options such as Snappy, LZ4 etc.

Information about SparkSQL.jl can be found here:

Tutorials:

Project page:

volkerkarle · October 12, 2021, 11:31am

Thanks, I will have a look!

Topic		Replies	Views
The ultimate guide to distributed computing Julia at Scale parallel , cluster , distributed	44	9878	June 21, 2021
State of distributed processing in Julia Julia at Scale	3	1635	May 14, 2019
Accumulating distributed data Data juliadb	0	507	August 28, 2019
Struggling with Julia and large datasets General Usage question , big-data	67	11053	October 17, 2024
JuliaDB w/ Workers on Remote Machine Data	2	429	June 18, 2019

Working with a cluster: Data storage & JuliaDB workflow?

Related topics