Quickest tabular file format when I do not know the number of rows

bsnyh · February 10, 2019, 6:18pm

There are many different ways to handle file IO. I noticed that there might be a big difference with respect to whether the number of rows is known in advance or not, in terms of speed, if I plan to read the file row by row and do certain things along the way to each row data. So, what is the quickest way to read a tabular format file row by row, if the number of rows of the tabular is known (not known)?

bkamins · February 10, 2019, 7:22pm

This is a very general question. Assuming the file is a text file with CSV similar structure then Home · CSV.jl should be fast.

sairus7 · February 10, 2019, 8:02pm

I’ve used plain binary files with a header info as a separate txt file (with the same name in the same folder), describing column names and types (and probably the number of rows). Data is written (appended) row by row. To read it, just get file size in bytes (or read it from header) and get row length as a sum of type sizes. Then you can read any row range in bytes and reinterpret them to known types - this is a way faster than parsing large text files.
If you need to read individual columns, you an store each column in a separate binary file. If you want additional features like compression with chunked reading, you can probably use some libraries such as HDF5 or Zarr.

Topic		Replies	Views
Is this an efficient way to read a .csv file row by row? General Usage	9	3062	January 27, 2019
Handle large csv file using `enumerate(CSV.File())` or `CSV.read()`? New to Julia	3	551	April 21, 2019
CSV.Row very slow for reading files line by line Performance package , csv	0	282	May 9, 2023
.csv number of rows Data csv	6	3306	September 13, 2022
Alternative to DataFrame Readtable to read large data files with headers Data	17	4042	November 12, 2018

Quickest tabular file format when I do not know the number of rows

Related topics