Reading huge csv files


#1

I would like to read a huge csv file around 9G and apply filters to the rows in order to build another dataframe carrying only the values I want to. I have installed the pack DataBase and I have just tried the simple command :

using JuliaDB 
flights = loadtable("Document.csv")

and I got the following error:

OutOfMemoryError()

How I can deal with request of this type on Julia? Thanks in advance.


#2

Unfortunately, handling large data in Julia is difficult at the moment. I would recommend opening a CSV.File, iterating through it, and only keeping what you need. Look at the docstrings in CSV.jl.


#3

Thanks for your answer. I was wondering about your suggestion: I cannot use CSV.read because this function cannot neither deal with this huge file csv. Maybe I didn’t understand well your suggestion. Could you clarify it to me ? Thanks.


#4

This is the docstring I suggested you read:


#5

As @Tamas_Papp mentioned, using CSV.File(file; kw...) will return a CSV.File object which doesn’t load the entire dataset into memory. You could then “build up” a table by iterating over the rows and filtering as you’d like, something like:

function buildtable(filter::Function, file)
    f = CSV.File(file)
    # create a NamedTuple of Vectors to push! to
    table = (colA=Int[], colB=Float64[], colC=String[])
    for row in f
        if filter(row)
            push!(table.colA, row.colA)
            push!(table.colB, row.colB)
            push!(table.colC, row.colC)
        end
    end
    return table
end

#6

This doesn’t seem to be true:

function readtest()
    i = 1
    for row in CSV.File("abc.csv"; allowmissing=:none)   
        i += 1
        if i > 5
            break
        end
    end  
end
readtest()

where abc.csv is ~10GB doesn’t work. If I use

CSV.File("abc.csv"; allowmissing=:none, limit=5)

it works.
It seems it’s trying to read the whole dataset in the first case or am I wrong?