How do you store your data before and after processing with Julia?

Nathan_Boyer · June 11, 2020, 3:04pm

Overview
I intend to process simulation data generated externally with Julia and would like some opinions on storage solutions for 1) the raw data, 2) the processed data, and 3) the processing inputs.

Data Description
There will be many similar projects each with several sets of simulation data. This simulation data will just be simple nodal values that can be stored as a 2D array [node_numbers,nodal_data]. The processed data will also likely be simple and 2D. The processing inputs will consist of constants and small vectors.

Considerations
There are other users who will need access to the data. They are used to working in Excel. I am thinking that storing all three data types as .csv files and organizing the projects with the file system would have the smallest learning curve. However, I am not sure that it is the best long-term solution. I have no experience with databases, but I understand that they can be viewed similarly to Excel and can perform some of the project organization all in one place. Moreover, I keep changing my mind on whether I should store processing inputs as a text file, spreadsheet, or just keep them at the top of the Julia code.

How do you store your data and how do you think I should store mine?

PetrKryslUCSD · June 11, 2020, 3:42pm

I found Json files handy for unstructured data and metadata.

JesperMartinsson · June 11, 2020, 4:25pm

The HDF5.jl format is also an option if you want hierarchical features and compression. Good for scientific data and you can save and read it with many other languages.

klaff · June 11, 2020, 4:54pm

Home · XLSX.jl lets you read, create, or modify .xlsx files. I’m not suggesting that as your primary means of storing things, but in particular if you choose something other than .csv it gives you an option for interacting with coworkers who can’t/won’t use Julia.

ctkelley · June 11, 2020, 5:52pm

I’d been doing ok with JLD2, but the latest readme tells me that could be a bad idea. That readme says the old jld is a possibility. I’m not storing anything terribly large and most of the data are arrays of floating point numbers.

I want “load” and “store” and jld2 was fine for that. HDF5 seems harder to use. Is that right?

Nathan_Boyer · June 11, 2020, 6:21pm

Piggybacking off @ctkelley, I’d also like a format that is simple (unless it provides superior organization to justify complexity). Ideally something that non-coders could open and read. HDF5 seems like it requires special software to look at the data, and JSON data seems to be buried in formatting. I’m not sure if either would work well for my (co-workers’) needs.

juliohm · June 11, 2020, 6:53pm

Welcome @Nathan_Boyer. From your description of the data, it seems like you are interested in scientific meshes? Pardon if that is not correct.

In the affirmative case, I think you can leverage standard formats such as VTK: https://github.com/jipolanco/WriteVTK.jl

lungben · June 11, 2020, 7:16pm

For unstructured data, YAML (YAML.jl) is an alternative to JSON. The concepts are very similar, but I find YAML a bit easier to read and write as a human.
When sharing tabular data with other users, I usually use csv. It is simple and can be read by practically all programming languages, Excel and humans directly.

Eben60 · June 11, 2020, 7:17pm

See also discussion here. I just recently came across that thread after having some problems with JLD2 and finding out that the package is “officially abandoned” now. Of the solutions discussed there, I opted for BSON.

However if the data are to be directly readable by Excel, .csv or native Excel format appear to be the only options. Otherwise you could probably consider writing some simple converter from BSON (or like) into .csv

natgeo-wong · June 11, 2020, 7:36pm

I used NetCDF format (NCDatasets.jl or NetCDF.jl) for storing of 2/3/N-D gridded data.

Nathan_Boyer · June 11, 2020, 7:36pm

My data are indeed solutions on unstructured meshes. I was hoping for one format to rule them all to avoid duplicate data. However, this discussion leads me to believe I may need to select a coding friendly format for me and my calculations and then write out the most important data back to .csv at the end for sharing. There are way more options than I realized.

lungben · June 11, 2020, 7:54pm

If you are storing tabular data (using DataFrames.jl) just for yourself / other users of your code, https://github.com/xiaodaigh/JDF.jl is a great choice - very fast and small file-sizes.
However, it can only be read with Julia.

Paul_Soderlind · June 11, 2020, 8:47pm

In my case:

Before: in the format the data came to me. Key point: don’t change those files.

After: csv or HDF5 depending on the structure and size.

dact · June 11, 2020, 11:02pm

I usually store raw data as plain text: comma/whitespace-separated values or Esri ASCII grid files (.asc). They are easy to read in Python, Matlab or Excel for my colleagues. When writing files to disk has a significant impact on execution time, I prefer storing raw data as binary and save an auxiliary text file describing its structure (I have only done this in Fortran).

For processed data I use the same formats as the raw data.

I use plain text files for inputs. In each line I write the name of the parameter separated from one or more values by whitepace:

velocity 1.2
depth    10.6
tau      0.8,1.1,1.3,1.5

I suggest to store your data in text files unless writing to disk has a significant impact on the total execution time of your simulations. Text files are always easier to read for other users in a variety of applications and programming languages.

Nathan_Boyer · June 12, 2020, 3:46pm

Is HDF5 preferred over SQL?

Is a text file for inputs preferred over global Julia constants?

Tamas_Papp · June 12, 2020, 4:04pm

That depends on your use case — whether you are familiar with either, the kind and the size of data you are storing, how you want to access it, whether you need SQL-specific features like ACID.

This, again, depends on your use case. Eg a text-based format like CSV or JSON may be accessible to other tools/languages. OTOH Julia expressions in code for tiny but complex data would work fine.

These days I usually work with HDF5 and JSON and find both pretty nice. Generally I avoid formats that serialize/deserialize Julia objects as is, and invest a bit in keeping things language-agnostic (one would do this anyway for CSV, it is less work with the formats above). HDF5 is better for replacing parts of the data.

Nathan_Boyer · June 12, 2020, 4:38pm

Is it fair to say that moving from left to right increases features but decreases portability?

Text-Based Formats:
CSV JSON YAML ASDF

Binary-Based Formats:
Stream Binary BSON HDF5 JLD

dact · June 12, 2020, 5:43pm

Is it fair to say that moving from left to right increases features but decreases portability?

I think that’s right.

Again, as @Tamas_Papp wrote, preference for one format over another depends on your use case.

For example, my colleagues are civil engineers and they are more comfortable working with CSV and spreadsheets than more complex and robust data-interchange formats. That’s why I use plain text files for most of my work.

Tamas_Papp · June 13, 2020, 7:50am

I am not sure about this. These days having support for JSON is as universal as CSV in commonly used languages.

I think that you are still trying to solve this problem in the abstract, but the optimal solution will depend on your use case (data size and type, languages you and our collaborators use, various trade-offs with performance, storage space, etc). We know very little about your use case so it is hard to give more specific advice.

I would recommend that you stop dwelling on this, and write a few simple functions for reading/writing your data using whatever format you want to try first, then you can experiment and change this easily later on.

ImreSamu · June 13, 2020, 12:42pm

I have a strong SQL background …so I am biased …

but I will consider :
About SQLite ( + SQLite.jl )

PRO:

EXCEL: You can access SQLite via ODBC driver
good JSON support: JSON Functions And Operators
you can store multiple csv files inside
- with metadata
- with log files
- with Julia source code , …
GUI: I am using for debugging the https://sqlitebrowser.org/ on Linux ;
- but has a Windows / MacOS / Homebrew / … client.
Appropriate Uses For SQLite
- Application file format : SQLite As An Application File Format
- File archive and/or data container
Interoperability … .multiple readers … ( Python / R / GoLang / … )

CON:

The Julia documentation is minimal … SQLite | SQLite.jl
need a minimal SQL knowledge.

Edit_1:

HN:SQLite as an Application File Format
- SQLite as an Application File Format (2014) | Hacker News

Topic		Replies	Views
Best practices for local exploratory data analysis Data	15	1553	October 16, 2024
Struggling with Julia and large datasets General Usage question , big-data	67	11085	October 17, 2024
What is the preferred way to save variables? General Usage jld , hdf5 , jld2	39	18538	August 24, 2021
A future for JLD2? Community jld2	56	9800	July 19, 2020
Resurrecting universal database API Data	60	4575	August 17, 2018

How do you store your data before and after processing with Julia?

Related topics