My Problem: Assume I work on a cluster spawning thousands of jobs (in parallel) and each job saves its outcome in some .hdf5 or .jld file. After all jobs have finished, I collect the data and put them in one folder, where I read individual variables and extract them to new files for further processing (all the data is too big to process it in memory). However, this data extraction takes a lot of time and I have the feeling that it could be done much simpler. I have seen how you can use JuliaDB to extract data online, but apparently this would also require some preprocessing of the variables.
What I want: I wonder if it’s an overkill, but I have the feeling the best solution to this problem would be to setup a running database service, which listens on some port and saves the data in some beautiful (possibly perfectly compressed) database. This would also bring about the advantage that one could check if something been calculated already and retrieve data from there.
What do you think, how do solve your “data storage” problems and what do you consider to be the best practice for big data?
Maybe your cluster is based on the nfs file system? This might be the bottleneck here. You could try to do as little read/write to nfs with lots of small files as possible, i.e. try to combine as much as possible to as large files as possible. Or do the post processing on another file system. Not sure how you could do that though in your application.
You are right, it is NFS. Limiting the read/write is difficult, since the idea is to utilize as many cores in parallel as possible using many jobs, and every job needs to safe results somewhere.
With some effort you could parallelize your code (e.g with Distributed or MPI) instead of spawning thousands of jobs. You could even just use Threads.@threads on one node to do nthreads() jobs on one node. You could then combine some results into one larger file instead of one file each. Otherwise you could at least limit yourself to reading from nsf / extract multiple files into one larger one!?
That would be great, but if I do not spawn them individually, they will never start in the queue (they are already parallelized individually and take a bunch of cores).
The SparkSQL.jl Julia package might be a fit for your use case. SparkSQL.jl helps Julia programmers work with very large datasets using Apache Spark and Julia. It supports saving data to the parquet format. Parquet has many compression options such as Snappy, LZ4 etc.