To load a CSV file that doesn’t have the *.csv
or *.tsv
file extension, you need to tell the load
function that you want to load a given file as a CSV file (load
can’t detect this by itself from the file extension):
using CSVFile, FileIO
x = load(File(format"CSV", "text.txt"), '\t', header_exists=false)
You can call IteratorInterfaceExtensions.getiterator(x)
to get something that can be iterated row by row. So your code might then look like:
using CSVFile, FileIO, IteratorInterfaceExtensions
f = load(File(format"CSV", "text.txt"), '\t', header_exists=false)
for (index,row) in enumerate(getiterator(f))
l_chr, l_coo, r_chr, r_coo = Symbol(row.Column2), row.Column3, Symbol(row.Column5), row.Column6
end
But be warned, this (right now) will still first load all rows into memory, and then iterate over them, i.e. this is not a streaming implementation. I plan to eventually add support for a proper streaming implementation, the pieces in TextParse.jl and CSVFiles.jl are all there, but it is not hooked up at this point. In my mind, streaming implementations for this have its place, but I think for most use cases one is actually better off loading things into memory first (if there is enough memory, of course) and then processing things from there.
Loading only a subset of columns is another case, and support for that is almost done, see this PR. If you are adventurous, you can try it out and report back how that went, I’d love to hear some feedback on that! The implementation in that PR is efficient, i.e. if you exclude some column that way, it won’t take up any memory etc.
In general, I’ve put a fair bit of work into TextParse.jl lately to make sure it works with very large files, and in particular files that are larger than your main memory. So for example, the column skipping PR should right now work really well with a file that is way, way, way larger than your main memory, from which you only want to load say three columns that would fit comfortably into your main memory.
The eventual goal is to hook this column skipping stuff up with some new work we are doing over in Query.jl that makes it easier to select a subset of columns. I hope to get something like load("foo.csv") |> @select(startswith"bar") |> DataFrame
to work such that it would actually only load those columns from the CSV file where the column name starts with bar
. The design for that is complete, but we’ll need a bit more time to finish all the necessary steps to implement it.