How do you store your data before and after processing with Julia?

dlakelan · June 13, 2020, 2:24pm

agree with this view of sqlite. will also add that it can handle tens to hundreds of gigabytes of data without problems. I’ve done things like plop decades of census data at the individual level into sqlite. I also used it for my dissertation where I stored all the data from molecular Dynamics simulations for later analysis.

it’s useful to read up a bit on SQL. there are many good books for this purpose. choose one focused on data analysis and not transaction processing

sairus7 · June 13, 2020, 3:04pm

How does SQLite compare to HDF5 and raw binary files in terms of performance for chunked compressed storage for large and growing arrays, that does not fit in memory and should be read/write chunk-by chunk?
I heard (didn’t test myself) that it is inefficient to stoge large objects (images, arrays and so on) in SQL databases, is it true or not?

dact · June 13, 2020, 4:05pm

I heard SQL is slower. For small datasets, especially structured datasets, you could use a self-contained solution like SQLite. Also, a columnar NoSQL for larger datasets.

But keep in mind that formats such as HDF and NetCDF were designed for large amounts of scientific data (arrays, rasters), so they are faster to read and write.

ImreSamu · June 13, 2020, 4:21pm

TLDR: You have to use what you know and what you like

And Benchmark first !

imho: SQLite can be important if the Interoperability is a high requirement. (using data from Mobile phone to HPC )

You can use SQLite - for reading/writing extra large data … like any SQL … it will work!

SQLite is used by mobile phone apps also; You can store the Full Earth - OpenStreetMap vector data ( > 60GB ) preprocessed and splitted in SQLite3 format( inside GoogleProtobuff) … and use by any low profile Android mobile phone … see MBTiles format later )

imho: it is possible to combine the both of the worlds …

( special compression + SQL + data splitting + indexes + views … )

I don’t have any experience in HDF5
but using only the basic SQLite columns ( without special compression) vs. HDF … in simulation data … you can read this paper (2015):
https://www.researchgate.net/publication/282986584_Efficient_Management_of_Big_Datasets_Using_HDF_and_SQLite_A_Comparative_Study_Based_on_Building_Simulation_Data

you can add/use special - extra - compression with BLOB format;

so you can use as a container format.

GEO examples:

MBTiles " MBTiles is a specification for storing tiled map data in SQLite databases for immediate usage and for transfer. MBTiles files, known as tilesets , MUST implement the specification below to ensure compatibility with devices."
- https://github.com/mapbox/mbtiles-spec/blob/master/1.3/spec.md#content
  - inside: jpg, png, webp
    - or gzip-compressed vector tile data in Mapbox Vector Tile format
      - ~ Google Protocol Buffers .
GeoPackage " A GeoPackage ( GPKG ) is an open, non-proprietary, platform-independent and standards-based data format for geographic information system implemented as a SQLite database container. Defined by the Open Geospatial Consortium (OGC)[1] with the backing of the US military[2] and published in 2014, GeoPackage has seen wide widespread support from various government, commercial, and open source organizations… "
- WKB inside … ( Special Geo binary )
- this is like an Open version of ESRI Shape format.

dlakelan · June 13, 2020, 4:33pm

It can be faster than storing a large number of individual files, especially if the files aren’t too big. So if you have a few thousand 1MB blobs one per file, you’re probably better off putting them as a few thousand rows with blobs in one SQLite file, because the overhead of opening the individual files and reading the data is higher than the overhead for SQLite to use its already open file to grab different rows.

https://www.sqlite.org/fasterthanfs.html

You’d have to test in your specific use case, but I promise you SQLite is much more usable by your friends that only deal with Excel spreadsheets than HDF5 is… (access via ODBC from Excel)

Here’s the thing about HDF5 vs SQLite… both of them use B-Trees to create a “filesystem like” container object. But SQLite comes with a fully functional standardized query language and works well for both meta-data and data itself. It has a blob format so you can blob in enormous binary objects if you want. SQLite is far far more widely used. There are over 1 Trillion SQLite files in existence. Most Widely Deployed SQL Database Engine

SQLite might seem like a toy thing to people used to dealing with supercomputers, but in real world use, it’s way way more battle tested than HDF5 and very much actively maintained.

Also note that there are multiple “modes” of operation for SQLite… for example if you’re planning to read and write lots of data, using the “WAL format” Write-Ahead Logging rather than the default format. Basically you run the SQL command

pragma journal_mode=WAL;

and you have a WAL database.

EDIT: The “lite” part of SQLite comes from the fact that it doesn’t require you to install a database server process and create database users with access controls, and it’s not designed for a typical SQL workflow where hundreds or thousands of clients each read and write a few rows at a time to a big table (like for people who process credit card transactions). In many ways it should be called SQLHeavy, because it’s ideal for HEAVY analysis of data. it’s better for data analysis, because it doesn’t have all the overhead of worrying about hundreds of people inserting rows in your table while you’re joining it to another table… so it does data analysis tasks quite quickly. For example if you have the entire Census ACS at the microdata level and want to join the households table (millions of rows) to the person table (tens of millions of rows) and extract a randomized subset that meets some requirements… it will do this faster than most SQL servers would.

If the only thing you work with is a specialized kind of binary data blob that goes in and out of a special FEM solver or something… then obviously SQLite isn’t necessarily of interest. But if you have hundreds of FEM solves and need to ship them all to a colleague along with meta-data about the conditions that are being simulated, and the experimental data taken off the aerodynamic wind tunnel physical model for comparison… I’d be looking at trying out SQLite. Any time you want to “bundle” a lot of related data together it’s a good candidate.

Eben60 · June 13, 2020, 5:45pm

For those of us without background in databases: Could probably one of the experts post a simple SQLite/Julia usage example? Like e.g. saving/reading a 2d array of Float64 with an accompanying UTF8 text.

Tamas_Papp · June 14, 2020, 6:26am

As much as I like SQLite, I don’t think the overhead is major either way, especially with modern filesystems operating in a reasonable hardware environment.

There are lot of other differences between various options (eg package support, fixed cost of setup, portability, concurrent access, data integrity, long run support for the data format) that may be more significant in practical applications.

Iulian.Cioarca · June 14, 2020, 9:35am

I’m not one of the experts, but this is what I used a while ago:

using SQLite
using DataFrames
# https://www.tutorialspoint.com/sqlite/sqlite_select_query.htm
my_dir = "C:\\Projects\\SQLite"
cd(my_dir)
dbfile = joinpath(my_dir,"my_sqlite.sqlite")

# Define database
db = SQLite.DB(dbfile) # create the file
table_name = "MyTable" # now create a table inside the file, with Value1, Value2, Value3 as header
schema = SQLite.Tables.Schema((:Value1, :Value2, :Value3), (Int, String, Array{Float64,1}))
SQLite.createtable!(db, table_name, schema; temp=false, ifnotexists=true)

# Save to database
Value1 = 1
Value2 = "some string"
Value3 = [1.1,2.2,3.4,4.1]
DBInterface.execute(db, "INSERT INTO MyTable VALUES (?,?,?)", [Value1, Value2, Value3])

# Read everything from database into a DataFrame
mytable = DBInterface.execute(db, "SELECT * FROM MyTable")|>DataFrame
# get Value3, the vector of Float64
mytable.Value3

SQLite.jl documentation is good, but you will need to look into the SQLite commands (link in code) to do more stuff.
Tested with SQLite 1.0.3 and DataFrames 0.21.2

lungben · June 14, 2020, 10:51am

I did a quick comparison of some storage methods:

method	read-time (s)	write-time (s)	size (MB)	size-gzipped (MB)
CSV	0.3	6	43	7.6
SQLite	36	26	196	14.9
JDF	0.08	0.13	15.12	10
Feather	0.002	0.79	64	7.7

Source code:

github.com

lungben/julia_experiments/blob/master/notebooks/data_write_read.ipynb

{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "using Pkg"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [

This file has been truncated. show original

I have not done any optimizations for the storage methods, only used the most simple interface for each.

Edit: the Sqlite timings look too bad for me. Maybe I am missing something obvious?

dlakelan · June 14, 2020, 2:37pm

run the pragma statement I mention above to switch to WAL mode before doing your timings.

I tried some timings on my machine, which is not at all super fast, but I was on my /tmp directory which is a tmpfs (in memory) filesystem so at least disk access is not the main issue. In WAL mode, writing the database took about 2.5 seconds. But doing select * from taxi took on average 10 to 11 seconds. which is I think way too long. So I’m guessing there’s some inefficiency in the way in which the Julia SQLite interface interacts with the database.

In particular, the output of

@btime dfx = DBInterface.execute(db, "select * from taxi") |> DataFrame

  10.192 s (22607532 allocations: 521.60 MiB)

450,627 rows × 20 columns (omitted printing of 15 columns)

Compare to reading the same table in R:

> library(RSQLite)
> 
> db  = SQLite()
> 
> dbc = dbConnect(db,"test_out.db")
> 
> t = proc.time()
> 
> df = dbReadTable(dbc,"taxi")
> 
> proc.time()-t
   user  system elapsed 
  1.853   0.017   2.396 
> 

> t=proc.time()
> df = dbGetQuery(dbc,"select * from taxi")
> proc.time()-t
   user  system elapsed 
  1.611   0.020   2.165 
>

So there is definitely some Julia specific implementation slowness as Julia is 5x as slow as R.

@quinnj seems to be the latest committer to the Julia SQLite implementation, perhaps there is some obvious explanation he can offer?

If absolute speed is your main concern, because you have to read and write large datasets all day long, Feather is going to be the best choice. Its “read” implementation is basically just mmap the disk file into memory, done… Write is basically flush raw memory to disk.

However its one-table one-file and not exactly something people are going to read into Excel easily.

ImreSamu · June 17, 2020, 6:03pm

thank you for you benchmark …
according your Julia test code : SQLite data ~196MB
on the other hand : my reference Sqlite3 implementation with extras : ~49Mb

it is strange … we need to inspect more deeply …

I have created a reference sqlite3 code

https://gist.github.com/ImreSamu/af34e1b4807fc4c3afad29d1b238252d
with importing same csv ( green_tripdata_2019-12.csv)
- with FOREIGN KEY constraints.
- 450,627 rows × 20 columns ( but keeping ALL columns, not dropping )
with +4 small code table for labels ( for mortals )
- payment_type
- ratecode
- trip_type
- vendor
with +3 extra views for CEO-s / easy Excel processing / with business logic
- I can’t expect anybody understand what is VendorID=1 , so I created a view
my sqlite3 version: SQLite version 3.31.1 2020-01-27 19:55:54

there are small differences in types: ( I am using text type for datetime ) , but the 49MB vs. 196Mb is huge differences …

Nosferican · June 17, 2020, 6:55pm

Usually compressed JSONL (JSON3.jl) which are inflated and uploaded to a PostgreSQL database (LibPQ.jl). Most of the transformations are done in SQL and then ones the data is ready for analysis / viz (StatsMakie.jl) I pull those to Julia again.

lungben · June 17, 2020, 8:34pm

Two DateTime columns were saved in BLOBs in SQLite, causing the large file size. When I convert them to string before writing to SQLite, I get the same file size as you.
Actually, there is already an issue for this:
https://github.com/JuliaDatabases/SQLite.jl/issues/160

I worked mostly with Oracle and Postgres before and was not aware that there is no dedicated datetime type in SQLite

Thanks for the finding!

purplishrock · June 17, 2020, 9:49pm

Iulian.Cioarca:

# Define database
db = SQLite.DB(dbfile) # create the file
table_name = "MyTable" # now create a table inside the file, with Value1, Value2, Value3 as header
schema = SQLite.Tables.Schema((:Value1, :Value2, :Value3), (Int, String, Array{Float64,1}))
SQLite.createtable!(db, table_name, schema; temp=false, ifnotexists=true)

I really like sqlite and i like the sqlite module, and have been using it for a lot of engineering applications.

however, you’ve inadvertently pointed out the biggest problem with it, it can be hard to figure out how to do easy things. Your example of creating a schema, in Julia, is something i didn’t even know you could do.

Now, that i know, i went back through the docs and still don’t understand how the docs tell you that it’s possible. I would never have figured it out without your example, and yet such an example is not in the docs.

And it’s a REALLY useful thing otherwise you do what i do, which is to manage the schema through sqlite3 directly.

I think the docs are quite thorough and yet i have a lot of difficulty figuring out how to do easy things.

My point here is not to complain, but to say, be patient, it’s worth the effort. I find it to be a great way to handle datasets for measurements. I’ve created DBs about 20M in size and accessing them is very fast.

edit: and ask for help ! the list is great, many power users of the sqlite module

dlakelan · June 17, 2020, 10:14pm

The schema thing is documented only indirectly: Home · SQLite.jl!

after the fact you can find that by searching for “schema” but before the fact not knowing what to search for… not so easy.

You can also always create schemas by writing the SQL code to do so:

DBInterface.execute(mydb, "create table foo(bar int, baz string, quux real);")

NightMachinary · March 1, 2021, 7:42am

What good books for SQL data analysis do you recommend? Plus points if they are in jupyter notebook format.