Benchmarks of Various Formats for Tabular Data

lungben · November 20, 2020, 11:13am

Hi,

here is a Pluto.jl notebook for benchmarking of the read and write performance, as well as file sizes of various formats for tabular data.

gist.github.com

https://gist.github.com/lungben/7d967eb5058bbe5708bb08fb1aeb2815

table_benchmarks.jl

### A Pluto.jl notebook ###
# v0.12.11

using Markdown
using InteractiveUtils

# This Pluto notebook uses @bind for interactivity. When running this notebook outside of Pluto, the following 'mock version' of @bind gives bound variables a default value (instead of an error).
macro bind(def, element)
    quote
        local el = $(esc(element))

This file has been truncated. show original

The following formats / packages are compared:

CSV via GitHub - JuliaData/CSV.jl: Utility library for working with CSV and other delimited files in the Julia programming language
JSON via GitHub - JuliaData/JSONTables.jl: JSON3.jl + Tables.jl
Zipped CSV via GitHub - fhs/ZipFile.jl: Read/Write ZIP archives in Julia
JDF via GitHub - xiaodaigh/JDF.jl: Julia DataFrames serialization format
Parquet via GitHub - JuliaIO/Parquet.jl: Julia implementation of Parquet columnar file format reader
Apache Arrow via GitHub - apache/arrow-julia: Official Julia implementation of Apache Arrow
Excel (xlsx) via GitHub - felipenoris/XLSX.jl: Excel file reader and writer for the Julia language.
SQLite via GitHub - JuliaDatabases/SQLite.jl: A Julia interface to the SQLite library

I always used the default configuration of each package, i.e. multithreading or compression is only used if it is switched on by default.

On my machine, Arrow is fastest, followed by JDF (note that Arrow is not compressed per default by JDF is).

quinnj · November 21, 2020, 4:14am

Very interesting! Thanks for sharing!

Topic		Replies	Views
Benchmarking ways to write/load DataFrames IndexedTables to disk Data	42	6962	October 25, 2018
CSV Reader / Writer Choices Data	1	735	August 28, 2018
[ANN] JDF.jl - Experimental Julia DataFrames serialization format Package Announcements	3	1428	January 19, 2020
Recommended Saves and Loads of DataFrame : JLD, CSV, etc Data	8	2895	August 30, 2020
Writing Parquet files General Usage	28	5255	November 12, 2020

Benchmarks of Various Formats for Tabular Data

Related topics