How to read a large TSV file using Julia?

I have a very large TSV file, about 10GB, it has ~10,000 text lines for metadata information (different versions can have different total number of metadata lines), followed by a row of header information for about 20 parameters, then followed the millions and millions of rows of data.

What is the best way to read such a file in Julia? I tried to use CSV.read but it did not work. Any recommendations are greatly appreciated. Many thanks!

In what way did it not work? Did you use the keyword argument delim = '\t'?

2 Likes

Many thanks for the tip! It works much better with that keyword argument. However, I’m getting a ton of errors like the below:

**┌ Warning:** thread = 1 warning: only found 1 / 15 columns around data row: 6618. Filling remaining columns with `missing`
**└** @ CSV ~/.julia/packages/CSV/b4GfC/src/file.jl:622

Is there a way to ignore the 10,000 or so lines of metadata information, but only read the header and values into a dataframe?

Maybe the keyword argument skipto will help? See here.

1 Like

Many thanks. This is very helpful. I would definitely need this argument in my code.

I wonder if anyone has experiences of detecting the number of lines that should be skipped heuristically. That way, I do not need to worry about crashing my computer by trying to open such a big file using Excel.

Consider opening the large text file using an adequate viewer for Windows. One that can be obtained for free from Microsoft store is Large text viewer.

Example of viewing file with 15GB (200M rows x 4 columns):

1 Like

Try the CSV Chunk reader in DataConvenience.jl if the one in CSV.jl doesn’t suit your needs

https://github.com/xiaodaigh/DataConvenience.jl#csv-chunk-reader

1 Like

I did something similar (for smaller files however). The files I wanted to import were CSV but there were multiple distinct tables in each file as well as sporadically placed notes. I read the entire file as a string and then had some heuristics for which rows were notes, which were part of a table, and when I had started or ended a table. Then I just pass the table chunks as a String to CSV.jl for parsing. If you know you just have one variable length section of notes followed by only a single table, you could easily do something similar.

1 Like

Thanks for sharing.

I’m doing the same thing right now. I read all rows using readlines, and then using occursin to check where my header row is located. It’s been working well.

I’m a Mac user and wonder if you know what’s the best Mac alternative? I did some google search and people are talking about BBEdit? I installed Hex Friend, it is truly fast except that it is binary and I can’t really read the information and tell which row contains what.