Is it possible to iterate over a very large CSV in Windows?

I’m using CSV.jl to iterate over a very large CSV, but in Windows it’s throwing the error:

Error: could not create file mapping: The operation comleted successfully

The same error that this issue describe:

https://github.com/JuliaData/CSV.jl/issues/424

The error appears with CSV.File and CSV.Rows, so i’m thinking that in Windows is not possible to iterate over very large CSV’s. Do exists another option to do that task?

I don’t need type inference, i can specify the name/type of every column, and the separator if necesary.

1 Like

My current code is:

using Pkg

Pkg.add("CodecZlib")
Pkg.add("CSV")
Pkg.add("Queryverse")

using Query, CSV, CodecZlib, Dates

function mylength(iter)
    n=0
        for i in iter  
            n+=1
        end
    return n
end

function test(src:: String)
    open(GzipDecompressorStream, src) do stream
        table = CSV.Rows(stream; 
            header=[:day, :glnprovider, :glnretailerlocation, :gtin, :inventory, :cost, :sales, :price],
            dateformat="yyyy-mm-dd",
            types=[Date, UInt64, UInt64, UInt64, Float32, Float32, Float32, Float32],
            strict=true)
            return table |> mylength
    end
end

test("D:\\Data\\03012019_03312019_17440.csv.gz")

Can you post a complete example?

I changed the question with the complete code that fails. I can’t share the data, but is a CSV with size:

  • Compressed: 2.4 GB
  • Uncompressed: Almost 50 GB

Can use another library for this task in Julia?

1 Like

You could try CSVFiles and see if it helps.

Iterating with CSV.Rows should have a low memory footprint and handle large files.

However if you have are having issues (limited memory?) then you could try streaming using basic primitives. In your case, it seems your CSV file has known types and is not complex to parse.

So perhaps you could try something like:

using Dates
const types = [Date, UInt64, UInt64, UInt64, Float32, Float32, Float32, Float32]
const testfile = "testfile.csv"

function testwrite(filename::AbstractString)
    open(filename, "w") do io
        println(io, "2019-08-23, 5, 123, 17, 13.5, 1200.34, 1500.80, 22.30")
        println(io, "2019-08-22, 4, 122, 16, 12.5, 1100.34, 1400.80, 21.30")
    end
end

function testread(filename::AbstractString)
    nfields = length(types)
    for line in eachline(filename)
        strings = split(line, ",")
        fields = ntuple(i -> parse(types[i], strings[i]), nfields)
        println(fields)
        # do stuff with fields
    end
end

testwrite(testfile)
testread(testfile)
2 Likes

See this https://github.com/JuliaData/CSV.jl/issues/482

Perhaps setting reusebuffer=true will be helpful

The option reusebuffer=true doesn’t changed the result.

Thanks. I can try this, but probably i wil use another soluction (the idea is to teach the same tool for this type of task to others, and to be very easy is a possitive point)

I hoped this task was as easy as use a library. I’m thinking in Julia for use as ETL tool really, because it was easy to read, process and write data in streamming with higth performance. I suppose the ecosystem is not ready yet.

1 Like

Yeah. JuliaDB is not ready for general IMHO. I think I want try my hands at a Julia disk.frame-like or dask-like at some point.