How to read a large TSV file using Julia?

leon · October 24, 2021, 6:05pm

I have a very large TSV file, about 10GB, it has ~10,000 text lines for metadata information (different versions can have different total number of metadata lines), followed by a row of header information for about 20 parameters, then followed the millions and millions of rows of data.

What is the best way to read such a file in Julia? I tried to use CSV.read but it did not work. Any recommendations are greatly appreciated. Many thanks!

pdeffebach · October 24, 2021, 6:05pm

In what way did it not work? Did you use the keyword argument delim = '\t'?

leon · October 24, 2021, 6:17pm

Many thanks for the tip! It works much better with that keyword argument. However, I’m getting a ton of errors like the below:

**┌ Warning:** thread = 1 warning: only found 1 / 15 columns around data row: 6618. Filling remaining columns with `missing`
**└** @ CSV ~/.julia/packages/CSV/b4GfC/src/file.jl:622

Is there a way to ignore the 10,000 or so lines of metadata information, but only read the header and values into a dataframe?

pdeffebach · October 24, 2021, 6:20pm

Maybe the keyword argument skipto will help? See here.

leon · October 24, 2021, 6:40pm

Many thanks. This is very helpful. I would definitely need this argument in my code.

I wonder if anyone has experiences of detecting the number of lines that should be skipped heuristically. That way, I do not need to worry about crashing my computer by trying to open such a big file using Excel.

rafael.guerra · October 24, 2021, 7:05pm

Consider opening the large text file using an adequate viewer for Windows. One that can be obtained for free from Microsoft store is Large text viewer.

Example of viewing file with 15GB (200M rows x 4 columns):

xiaodai · October 24, 2021, 10:27pm

Try the CSV Chunk reader in DataConvenience.jl if the one in CSV.jl doesn’t suit your needs

https://github.com/xiaodaigh/DataConvenience.jl#csv-chunk-reader

tbeason · October 24, 2021, 11:17pm

I did something similar (for smaller files however). The files I wanted to import were CSV but there were multiple distinct tables in each file as well as sporadically placed notes. I read the entire file as a string and then had some heuristics for which rows were notes, which were part of a table, and when I had started or ended a table. Then I just pass the table chunks as a String to CSV.jl for parsing. If you know you just have one variable length section of notes followed by only a single table, you could easily do something similar.

leon · October 25, 2021, 1:29am

Thanks for sharing.

I’m doing the same thing right now. I read all rows using readlines, and then using occursin to check where my header row is located. It’s been working well.

leon · October 25, 2021, 1:32am

I’m a Mac user and wonder if you know what’s the best Mac alternative? I did some google search and people are talking about BBEdit? I installed Hex Friend, it is truly fast except that it is binary and I can’t really read the information and tell which row contains what.

Topic		Replies	Views
How to read .tsv files in julia New to Julia	2	534	August 18, 2023
Importing only specific lines from large csv file Data csv	6	938	January 11, 2024
Issues reading big CSV file despite using CSV.Row Data memory , csv , rcall	14	1343	July 18, 2022
Reading huge csv files Data	5	4072	January 19, 2019
CSV read in is too slow than other language General Usage performance	13	1355	June 21, 2023

How to read a large TSV file using Julia?

Related topics