How to use CSVFile as iterable?

Hello again. I’m testing reading a large CSV using CSVFiles instead of CSV.jl because the CSV file is too large for CSV.jl on Windows. But I can not use the CSVFile as a iterable. I’m trying:

function mylength(iter)
    n=0
        for i in iter  
            n+=1
        end
    return n
end

function test(src:: String)
    table = load(File(format"CSV", src);
        colnames=[:day, :glnprovider, :glnretailerlocation, :gtin, :inventory, :cost, :sales, :price],
        colparsers=[Date, UInt64, UInt64, UInt64, Float32, Float32, Float32, Float32],
        header_exists=false
    )
    return table |> mylength
end

test("D:\\Data\\03012019_03312019_17440.csv.gz")

I’m hoping to use the table as iterable to calculate diferent things in streaming. But, i’m getting the error:

MethodError: no method matching iterate(::CSVFiles.CSVFile)

So, i’m found the document developer guide from iterable tables. In that document, the author says that we can call the method getiterator. But when i tried that, the error is:

UndefVarError: getiterator not defined

So, how can i use a iterable table (the CSV file) as a iterator? getting the iterator somehow?

getiterator is defined in IteratorInterfaceExtensions.jl, so you need to load that package.

But be warned: CSVFiles.jl currently reads everything into memory, and then iterates from that. So if load("foo.csv") |> DataFrame doesn’t work because of memory limitations, then using getiterator will probably also not work (still worth a try, of course).

If you don’t need all of the columns of the file, you can try the new skip column feature that I’ve added to TextParse#master: make sure you are using that (pkg> add TextParse#master), and then something like load("foo.csv", colparsers=Dict(:colA=>nothing, :colC=>nothing)) |> DataFrame should work. In that case, colA and colB are not being loaded into memory at all.

My next project is to integrate the skip column feature with Query.jl, so that something like load("foo.csv") |> @select(-:colA, -:colC) |> DataFrame automatically skips those columns during load. No promise on timing, though :slight_smile:

EDIT: Oh, and I also plan to add a fully streaming mode at some point. Almost all the pieces for that exist already, so it actually shouldn’t be too difficult, but again, no timeline right now.

3 Likes

Thanks for your time. I’ll check just in case.

Great, any feedback on whether it works would be most welcome :slight_smile: