Is it possible to iterate over a very large CSV in Windows?

gabomgp · August 22, 2019, 2:23am

I’m using CSV.jl to iterate over a very large CSV, but in Windows it’s throwing the error:

Error: could not create file mapping: The operation comleted successfully

The same error that this issue describe:

https://github.com/JuliaData/CSV.jl/issues/424

The error appears with CSV.File and CSV.Rows, so i’m thinking that in Windows is not possible to iterate over very large CSV’s. Do exists another option to do that task?

I don’t need type inference, i can specify the name/type of every column, and the separator if necesary.

gabomgp · August 22, 2019, 2:29am

My current code is:

using Pkg

Pkg.add("CodecZlib")
Pkg.add("CSV")
Pkg.add("Queryverse")

using Query, CSV, CodecZlib, Dates

function mylength(iter)
    n=0
        for i in iter  
            n+=1
        end
    return n
end

function test(src:: String)
    open(GzipDecompressorStream, src) do stream
        table = CSV.Rows(stream; 
            header=[:day, :glnprovider, :glnretailerlocation, :gtin, :inventory, :cost, :sales, :price],
            dateformat="yyyy-mm-dd",
            types=[Date, UInt64, UInt64, UInt64, Float32, Float32, Float32, Float32],
            strict=true)
            return table |> mylength
    end
end

test("D:\\Data\\03012019_03312019_17440.csv.gz")

bernhard · August 22, 2019, 11:25am

Can you post a complete example?

gabomgp · August 22, 2019, 2:06pm

I changed the question with the complete code that fails. I can’t share the data, but is a CSV with size:

Compressed: 2.4 GB
Uncompressed: Almost 50 GB

Can use another library for this task in Julia?

haberdashPI · August 23, 2019, 1:00am

You could try CSVFiles and see if it helps.

greg_plowman · August 23, 2019, 3:08am

Iterating with CSV.Rows should have a low memory footprint and handle large files.

However if you have are having issues (limited memory?) then you could try streaming using basic primitives. In your case, it seems your CSV file has known types and is not complex to parse.

So perhaps you could try something like:

using Dates
const types = [Date, UInt64, UInt64, UInt64, Float32, Float32, Float32, Float32]
const testfile = "testfile.csv"

function testwrite(filename::AbstractString)
    open(filename, "w") do io
        println(io, "2019-08-23, 5, 123, 17, 13.5, 1200.34, 1500.80, 22.30")
        println(io, "2019-08-22, 4, 122, 16, 12.5, 1100.34, 1400.80, 21.30")
    end
end

function testread(filename::AbstractString)
    nfields = length(types)
    for line in eachline(filename)
        strings = split(line, ",")
        fields = ntuple(i -> parse(types[i], strings[i]), nfields)
        println(fields)
        # do stuff with fields
    end
end

testwrite(testfile)
testread(testfile)

xiaodai · August 23, 2019, 11:03am

See this Iterating over chunks efficiently · Issue #482 · JuliaData/CSV.jl · GitHub

Perhaps setting reusebuffer=true will be helpful

gabomgp · August 23, 2019, 2:38pm

The option reusebuffer=true doesn’t changed the result.

gabomgp · August 23, 2019, 2:44pm

Thanks. I can try this, but probably i wil use another soluction (the idea is to teach the same tool for this type of task to others, and to be very easy is a possitive point)

I hoped this task was as easy as use a library. I’m thinking in Julia for use as ETL tool really, because it was easy to read, process and write data in streamming with higth performance. I suppose the ecosystem is not ready yet.

xiaodai · August 23, 2019, 2:56pm

Yeah. JuliaDB is not ready for general IMHO. I think I want try my hands at a Julia disk.frame-like or dask-like at some point.

Topic		Replies	Views
How to use CSVFile as iterable? New to Julia	3	610	August 26, 2019
Reading huge csv files Data	5	4046	January 19, 2019
CSV mmap error when parsing large file General Usage package	6	2384	May 25, 2019
Query.jl fails handle CSV.Rows General Usage query , csv	1	320	February 23, 2023
"memory mapping failed" when reading many CSVs General Usage	11	2007	May 8, 2020

Is it possible to iterate over a very large CSV in Windows?

Related topics