Quickest tabular file format when I do not know the number of rows


#1

There are many different ways to handle file IO. I noticed that there might be a big difference with respect to whether the number of rows is known in advance or not, in terms of speed, if I plan to read the file row by row and do certain things along the way to each row data. So, what is the quickest way to read a tabular format file row by row, if the number of rows of the tabular is known (not known)?


#2

This is a very general question. Assuming the file is a text file with CSV similar structure then https://juliadata.github.io/CSV.jl/latest/#CSV.File should be fast.


#3

I’ve used plain binary files with a header info as a separate txt file (with the same name in the same folder), describing column names and types (and probably the number of rows). Data is written (appended) row by row. To read it, just get file size in bytes (or read it from header) and get row length as a sum of type sizes. Then you can read any row range in bytes and reinterpret them to known types - this is a way faster than parsing large text files.
If you need to read individual columns, you an store each column in a separate binary file. If you want additional features like compression with chunked reading, you can probably use some libraries such as HDF5 or Zarr.