using CSV, DataFrames
number_of_lines = 10^4
CSV.write("data.csv", DataFrame(rand(number_of_lines,10), :auto))
steps = 20
chunk = round(Int,number_of_lines/steps)
for i in 1:steps
@time a=CSV.read("data.csv",DataFrame,skipto=i*chunk ,limit=chunk )
end
Effectively, we just split the csv file into “chunks” of rows, and read them one by one.
It seems that when more lines are skipped, a lot more memory is allocated, and thus for reading very big files in chunks (e.g the example above using 10^6 lines), getting towards the end of the file becomes painfully slow.
Is there a way to deal with this issue?
Would it be more appropriate to use another function to import the data perhaps?
I think CSV.read needs to open the file again each time and parse it until it has reached the desired spot. So you get quadratic runtime in the number of chunks.
Can you just read the whole file (or up until the point you need) into a big DataFrame and then split the large Dataframe into chunks?
Alternatively, I found CSV.Chunks in the docs, which allows you to write something like this I think:
for chunk in CSV.Chunks("data.csv"; ntasks=steps)
@time a = chunk |> DataFrame
end
But I think that still loads the entire file at once.
The chunks don’t have the same size for some reason, which is bad news if you have a strictly formatted file, with headers at specific intervals.
Inititally I thought the header is the problem in the MRE above, but even after deleting it manually, I get exactly the same behaviour. Is that normal?
As an alternative, for anyone else reading this post, using numpy’s loadtxt() through PythonCall also does the job, but I’d rather stick to Julia if I have the option to do so.
Have you tried CSV.Rows? I believe it only reads one row at a time. The downside is that it returns elements as strings, so you have to parse the strings yourself.
Thank you for the suggestion, @CameronBieganek .
I’m not quite sure how to use rows to extract the information that I need (skipping to a paticular line, reading a number of lines from there).
When I use CSV.Rows(“data.csv”) on the MRE above, the results are as follows:
If you just search for some solution, then the following works
using CSV, DataFrames
rows = eachline("test.csv")
m = 2
dfs = DataFrame[]
while !Iterators.isempty(rows)
chunck = IOBuffer(join(Iterators.take(rows, m), "\n"))
df = CSV.read(chunck, DataFrame; buffer_in_memory=true)
push!(dfs, df)
end
Although it might be terribly slow due to lack of parallelism CSV.jl.
PS: Do you have control over the input CSV files? It feels much easier to split the files once in several files instead of dealing with such non-standard CSV files.
As I mention above, using numpy’s loadtxt() through PythonCall also does the job, and it’s pretty fast!
It just feels a bit wrong to rely on a python library for speed when using julia, so I thought perhaps I might be missing something when it comes to using the CSV.File or CSV.read functions.
As for the PS, I’m working on a file that’s almost half a billion lines long, split into 250000 different chunks, so I’d rather keep the file as is.
This will generally use less memory and allow you to transform and load into memory only the rows you need. I’m not sure if the all version of the package on the registry is currently working; so you might have to build the dev version. If you start Julia with many threads it will use multithreaded queries too. You can see the improvement of parquet vs csv in the 1 billon row writeup.
using DuckDB
db = DuckDB.DB()
DuckDB.execute(db, "CREATE VIEW data AS SELECT * FROM read_csv_auto('data.csv')");
@time a = DuckDB.execute(db, "SELECT * FROM data") |> DataFrame;
# convert to parquet on disk
DuckDB.execute(db, "COPY (SELECT * FROM read_csv_auto('data.csv')) TO 'data.parquet'");
@time a = DuckDB.execute(db, "SELECT * FROM 'data.parquet'") |> DataFrame;
This sounds like an excellent candidate for reading into SQLite. Do it once, and you can use SQL to repeatedly query for every subset / property of the data which you need. That’s what I’d do even if CSV.jl was optimally fast.
@rdavis120 and @mnemnion, thank you for recommending these alternative packages, I was not aware of them.
Looks like I got some homework to do, to decide which solution would be optimal.
However, I think it would still be worth looking at making CSV.jl capable of dealing with such cases, since it is probably the most popular package for reading files like these. I will proceed with submitting an issue on their page, just in case.
I also found it surprising that there is no easy solution (other than writing a own IOBuffer).
However, it’s not the scope of CSV.jl to deal with ill-formated CSV files. The condition of exactly having to meet the chunk size is also a challenge for parallelization (as one cannot predict where the new thread has to start since pointers to the right position in the file are unknown).
Does your file have multiple header rows throughout the file? Because that’s a major headache for any CSV reader. Are the headers all the same, or do the data schemas change for each chunk?
Pretty ugly format in my opinion, but it is what it is.
It comes from LAMMPS, a software for molecular dynamics simulations.
Anyhow, I don’t think we need to get into the specifics of my file format here, I just wanted to clarify whether I’m missing something in the CSV package.
Oh I’d never imagine that to be honest.
Would that be due to the inconsistent number of columns?
Nevertheless, the issue persists even for the MRE above, which is a well-behaved csv file as far as I’m aware, so I reckon there’s still something to investigate here.