CSVFiles line by line

Hi everyone,

I want to process a tab delineated file “text.txt”. So far I am able to parse the whole file into the memory by

using TextParse
csvread("text.txt", '\t'; header_exists=false) 

But I don’t know how to use the load() function from CSVFiles.

using CSVFiles
load("text.txt",'\t'; header_exists=false)
load("text.txt"; delim='\t', header_exists=false) 
...
#error
#FileIO.File{FileIO.DataFormat{:UNKNOWN}}("text.txt") couldn't be recognized by FileIO.

Eventually I want to process the file line by line. I have a version I implemented using CSV.jl

using CSV
f=CSV.File(file, delim='\t', header=false)
for (index,row) in enumerate(f)
   l_chr, l_coo, r_chr, r_coo = Symbol(row.Column2), row.Column3, Symbol(row.Column5), row.Column6
end

I wonder whether I can do something similar using CSVFiles since the file is big (comparing to my machine’s memory) and I also don’t need all of the columns.

Thanks a lot for helping.

To load a CSV file that doesn’t have the *.csv or *.tsv file extension, you need to tell the load function that you want to load a given file as a CSV file (load can’t detect this by itself from the file extension):

using CSVFile, FileIO

x = load(File(format"CSV", "text.txt"), '\t', header_exists=false)

You can call IteratorInterfaceExtensions.getiterator(x) to get something that can be iterated row by row. So your code might then look like:

using CSVFile, FileIO, IteratorInterfaceExtensions

f = load(File(format"CSV", "text.txt"), '\t', header_exists=false)

for (index,row) in enumerate(getiterator(f))
  l_chr, l_coo, r_chr, r_coo = Symbol(row.Column2), row.Column3, Symbol(row.Column5), row.Column6
end

But be warned, this (right now) will still first load all rows into memory, and then iterate over them, i.e. this is not a streaming implementation. I plan to eventually add support for a proper streaming implementation, the pieces in TextParse.jl and CSVFiles.jl are all there, but it is not hooked up at this point. In my mind, streaming implementations for this have its place, but I think for most use cases one is actually better off loading things into memory first (if there is enough memory, of course) and then processing things from there.

Loading only a subset of columns is another case, and support for that is almost done, see this PR. If you are adventurous, you can try it out and report back how that went, I’d love to hear some feedback on that! The implementation in that PR is efficient, i.e. if you exclude some column that way, it won’t take up any memory etc.

In general, I’ve put a fair bit of work into TextParse.jl lately to make sure it works with very large files, and in particular files that are larger than your main memory. So for example, the column skipping PR should right now work really well with a file that is way, way, way larger than your main memory, from which you only want to load say three columns that would fit comfortably into your main memory.

The eventual goal is to hook this column skipping stuff up with some new work we are doing over in Query.jl that makes it easier to select a subset of columns. I hope to get something like load("foo.csv") |> @select(startswith"bar") |> DataFrame to work such that it would actually only load those columns from the CSV file where the column name starts with bar. The design for that is complete, but we’ll need a bit more time to finish all the necessary steps to implement it.

4 Likes

Do note that CSV.File has been optimized for this exact purpose; a form of “lazy loading” of a csv file so that rows are streamed and only columns accessed are parsed. Currently on Windows, use_mmap=false is the default, so it’s not quite as memory efficient (but still should only buffer about 8k of a file into memory at a time), but you can set use_mmap=true in CSV.File and it should work fine.

Also note that you can do:

using CSV, Tables
rows = Tables.datavaluerows(CSV.File(file))

and rows can be used by all of the Query.jl data processing framework. I’m working on making that the default integration as well so that just using CSV.File will “just work” with all of Queryverse.

2 Likes

Tables.datavaluerows will trigger parsing of all columns, though? Or not?

Thanks for the help of @davidanthoff @quinnj. I am quite new for Julia and honestly I could not followed some of the nice featuresd you guys mentioned.

I like the streamed feature for CSV.File (@quinnj) because the file I am trying to parse is huge (90GB). But it takes very long to read all the data in (70s for a sample file of 3GB, size of 40M*13).

using CSV
function csv_parse(file)
    f=CSV.File(file, delim='\t', header=false)
    for (index,row) in enumerate(f)
       l_chr, l_coo, r_chr, r_coo = row.Column2, row.Column3, row.Column5, row.Column6
    end
end

I could not figure out how to use the skip-column feature of TextParse (@davidanthoff). But it inspires me that skip-parsing and TextParse could help me figure out some workarounds. I first split fields of each line by identify ‘\t’

function field_spliter(row)
    l_chr, l_coo, r_chr, r_coo = "", "", "", ""
    field_count = 1
    start_p, end_p = 1, -1
    for (pos, char) in enumerate(row)
        if char == '\t'
            start_p = end_p + 2 #end_p stands for last field's end position
            end_p = pos - 1
            #split field 2,3,5,6
            if field_count == 2
                l_chr = row[start_p:end_p]
            elseif field_count == 3
                l_coo = row[start_p:end_p]
            elseif field_count == 5
                r_chr = row[start_p:end_p]
            elseif field_count == 6
                r_coo = row[start_p:end_p]
                break
            end
            field_count += 1
        end
    end
    return l_chr, l_coo, r_chr, r_coo
end

Then read the file line by line

function customize_parse(file)
    open(file) do f
        for (index,row) in enumerate(eachline(f))
           #parsing fields
           l_chr, l_coo, r_chr, r_coo = field_spliter(row)
           l_coo = TextParse.tryparsenext_base10(Int64, l_coo, 1, length(l_coo))[1].value
           r_coo = TextParse.tryparsenext_base10(Int64, r_coo, 1, length(r_coo))[1].value
        end
    end
end

It takes 16s. Using parse() in Base takes 21s.

I think you should try passing the result of CSV.File(...) to the function instead of the file name. That will allow the compiler to specialize on the column types.