My Problem: Assume I work on a cluster spawning thousands of jobs (in parallel) and each job saves its outcome in some .hdf5 or .jld file. After all jobs have finished, I collect the data and put them in one folder, where I read individual variables and extract them to new files for further processing (all the data is too big to process it in memory). However, this data extraction takes a lot of time and I have the feeling that it could be done much simpler. I have seen how you can use JuliaDB to extract data online, but apparently this would also require some preprocessing of the variables.
What I want: I wonder if it’s an overkill, but I have the feeling the best solution to this problem would be to setup a running database service, which listens on some port and saves the data in some beautiful (possibly perfectly compressed) database. This would also bring about the advantage that one could check if something been calculated already and retrieve data from there.
What do you think, how do solve your “data storage” problems and what do you consider to be the best practice for big data?