Importing only specific lines from large csv file

Greetings everyone,

I am dealing with a very large CSV file, and I would like to only read specific parts of it and import them as a matrix.
Effectively, I want something like:

using DelimitedFiles
data = readdlm("myfile.csv")[n:m,:]

where “n” is the 1st row and “m” is the last row I’m interested in.
If I understand correcly, the code above loads the whole thing in the memory, which is not good if your data file is a few million rows long.
This is also going to be part of a loop, where in each iteration I load a different chunk from the csv, so performance matters.

What would be the most efficient and simple way to do this?

Checkout these examples in CSV.jl’s documentation:

https://csv.juliadata.org/stable/examples.html#skipto_example

https://csv.juliadata.org/stable/examples.html#footerskip_example

CSV.jl has functionality for reading large CSV files in chunks already: Reading · CSV.jl

To do exactly what you ask, you could e.g. do something like this with skipto and limit, but using the chunk iterator might be a better solution.

julia> using CSV

julia> data = """
       a,b,c
       1,2,3
       4,5,6
       7,8,9
       totals: 12, 15, 18
       grand total: 45
       """
"a,b,c\n1,2,3\n4,5,6\n7,8,9\ntotals: 12, 15, 18\ngrand total: 45\n"

julia> file = CSV.File(IOBuffer(data);skipto=3, limit=2)
2-element CSV.File:
 CSV.Row: (var"a,b,c" = String7("4,5,6"),)
 CSV.Row: (var"a,b,c" = String7("7,8,9"),)

1 Like

Thanks for the reply!

Thank you for your reply, hellemo.
It works well, but it imports the data as a CSV.file format.
Is there a way to use a matrix array instead?

You can use stack for example:

julia> stack(CSV.File(IOBuffer(data);delim=',',skipto=3, limit=2);dims=1)
2×3 Matrix{Any}:
 4  5  6
 7  8  9

But here we don’t get the type inferred, one solution (maybe not most efficient) would be to use identity to narrow the type in case they are all the same (as here):

julia> identity.(stack(CSV.File(IOBuffer(data);delim=',',skipto=3, limit=2);dims=1))
2×3 Matrix{Int64}:
 4  5  6
 7  8  9
2 Likes

It is worth noting that reading into a dataframe infers the types, and from there arrays can be used.

1 Like