Overview
I intend to process simulation data generated externally with Julia and would like some opinions on storage solutions for 1) the raw data, 2) the processed data, and 3) the processing inputs.
Data Description
There will be many similar projects each with several sets of simulation data. This simulation data will just be simple nodal values that can be stored as a 2D array [node_numbers,nodal_data]. The processed data will also likely be simple and 2D. The processing inputs will consist of constants and small vectors.
Considerations
There are other users who will need access to the data. They are used to working in Excel. I am thinking that storing all three data types as .csv files and organizing the projects with the file system would have the smallest learning curve. However, I am not sure that it is the best long-term solution. I have no experience with databases, but I understand that they can be viewed similarly to Excel and can perform some of the project organization all in one place. Moreover, I keep changing my mind on whether I should store processing inputs as a text file, spreadsheet, or just keep them at the top of the Julia code.
How do you store your data and how do you think I should store mine?
The HDF5.jl format is also an option if you want hierarchical features and compression. Good for scientific data and you can save and read it with many other languages.
Home · XLSX.jl lets you read, create, or modify .xlsx files. I’m not suggesting that as your primary means of storing things, but in particular if you choose something other than .csv it gives you an option for interacting with coworkers who can’t/won’t use Julia.
I’d been doing ok with JLD2, but the latest readme tells me that could be a bad idea. That readme says the old jld is a possibility. I’m not storing anything terribly large and most of the data are arrays of floating point numbers.
I want “load” and “store” and jld2 was fine for that. HDF5 seems harder to use. Is that right?
Piggybacking off @ctkelley, I’d also like a format that is simple (unless it provides superior organization to justify complexity). Ideally something that non-coders could open and read. HDF5 seems like it requires special software to look at the data, and JSON data seems to be buried in formatting. I’m not sure if either would work well for my (co-workers’) needs.
For unstructured data, YAML (YAML.jl) is an alternative to JSON. The concepts are very similar, but I find YAML a bit easier to read and write as a human.
When sharing tabular data with other users, I usually use csv. It is simple and can be read by practically all programming languages, Excel and humans directly.
See also discussion here. I just recently came across that thread after having some problems with JLD2 and finding out that the package is “officially abandoned” now. Of the solutions discussed there, I opted for BSON.
However if the data are to be directly readable by Excel, .csv or native Excel format appear to be the only options. Otherwise you could probably consider writing some simple converter from BSON (or like) into .csv
My data are indeed solutions on unstructured meshes. I was hoping for one format to rule them all to avoid duplicate data. However, this discussion leads me to believe I may need to select a coding friendly format for me and my calculations and then write out the most important data back to .csv at the end for sharing. There are way more options than I realized.
If you are storing tabular data (using DataFrames.jl) just for yourself / other users of your code, https://github.com/xiaodaigh/JDF.jl is a great choice - very fast and small file-sizes.
However, it can only be read with Julia.
I usually store raw data as plain text: comma/whitespace-separated values or Esri ASCII grid files (.asc). They are easy to read in Python, Matlab or Excel for my colleagues. When writing files to disk has a significant impact on execution time, I prefer storing raw data as binary and save an auxiliary text file describing its structure (I have only done this in Fortran).
For processed data I use the same formats as the raw data.
I use plain text files for inputs. In each line I write the name of the parameter separated from one or more values by whitepace:
velocity 1.2
depth 10.6
tau 0.8,1.1,1.3,1.5
I suggest to store your data in text files unless writing to disk has a significant impact on the total execution time of your simulations. Text files are always easier to read for other users in a variety of applications and programming languages.
That depends on your use case — whether you are familiar with either, the kind and the size of data you are storing, how you want to access it, whether you need SQL-specific features like ACID.
This, again, depends on your use case. Eg a text-based format like CSV or JSON may be accessible to other tools/languages. OTOH Julia expressions in code for tiny but complex data would work fine.
These days I usually work with HDF5 and JSON and find both pretty nice. Generally I avoid formats that serialize/deserialize Julia objects as is, and invest a bit in keeping things language-agnostic (one would do this anyway for CSV, it is less work with the formats above). HDF5 is better for replacing parts of the data.
Is it fair to say that moving from left to right increases features but decreases portability?
I think that’s right.
Again, as @Tamas_Papp wrote, preference for one format over another depends on your use case.
For example, my colleagues are civil engineers and they are more comfortable working with CSV and spreadsheets than more complex and robust data-interchange formats. That’s why I use plain text files for most of my work.
I am not sure about this. These days having support for JSON is as universal as CSV in commonly used languages.
I think that you are still trying to solve this problem in the abstract, but the optimal solution will depend on your use case (data size and type, languages you and our collaborators use, various trade-offs with performance, storage space, etc). We know very little about your use case so it is hard to give more specific advice.
I would recommend that you stop dwelling on this, and write a few simple functions for reading/writing your data using whatever format you want to try first, then you can experiment and change this easily later on.