CSV mmap error when parsing large file

I’m trying to read a large csv file, one row at a time.

I want to do something like:

using CSV
csvFile = CSV.File(infile)
for row in csvFile
    # do stuff with row
end

However, on Windows 10, Julia v1.1.1, I’m getting this error during CSV.File:

ERROR: could not create file mapping: The operation completed successfully.
Stacktrace:
 [1] error(::String) at .\error.jl:33
 [2] #mmap#1(::Bool, ::Bool, ::Function, ::Mmap.Anonymous, ::Type{Array{UInt64,1}}, ::Tuple{Int64}, ::Int64) at C:\cygwin\home\Administrator\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.1\Mmap\src\Mmap.jl:218
 [3] #mmap at .\none:0 [inlined]
 [4] #mmap#14 at C:\cygwin\home\Administrator\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.1\Mmap\src\Mmap.jl:251 [inlined]
 [5] mmap at C:\cygwin\home\Administrator\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.1\Mmap\src\Mmap.jl:251 [inlined]
 [6] file(::String, ::Int64, ::Bool, ::Int64, ::Nothing, ::Int64, ::Int64, ::Bool, ::Nothing, ::Bool, ::Array{String,1}, ::String, ::Nothing, ::Bool, ::Char, ::Nothing, ::Nothing, ::Char, ::Nothing, ::UInt8, ::Nothing, ::Nothing, ::Nothing, ::Nothing, ::Dict{Int8,Int8}, ::Bool, ::Float64, ::Bool, ::Bool, ::Bool, ::Bool, ::Nothing) at \\W43237\C$\Users\plowman\.julia\packages\CSV\IwqOm\src\CSV.jl:278
 [7] #File#20 at \\W43237\C$\Users\plowman\.julia\packages\CSV\IwqOm\src\CSV.jl:158 [inlined]
 [8] Type at \\W43237\C$\Users\plowman\.julia\packages\CSV\IwqOm\src\CSV.jl:158 [inlined]
 [9] top-level scope at .\REPL[5]:6 [inlined]
 [10] top-level scope at .\none:0

This is interesting because:

  • The error message says the operation completed successfully.
  • On Windows, the CSV default is to not use mmap.

A little further investigation shows that the error occurs for large files only. The code below is successful for rows = 10^6, 10^7, 10^8, but fails for rows = 10^9.

cols = 3
for rows in (10^6, 10^7, 10^8, 10^9)
    filename = string("test_", rows, "x", cols, ".csv")
    df = DataFrame(rand(rows, cols))
    println("Writing ", filename)
    CSV.write(filename, df)
    CSV.File(filename)
end

Yes, so here’s the rundown:

  • We had an issue raised about this already
  • On windows, the default is indeed use_mmap=false, but that just means CSV.jl won’t mmap the input file directly, but will mmap an anonymous buffer and copy the file contents into it. This is quite a bit faster than reading an entire file into a Vector{UInt8}
  • As noted in the issue above, there are indeed issues on windows for very large files (though with the 0.5.4 release, the memory usage is split up and windows does better with that, hence why the referenced issue was closed)

I spent a night or two digging in and trying to find a more ideal solution, but only came up with https://github.com/JuliaData/CSV.jl/pull/436, which definitely helps, but you can still run into scenarios that fail. We could try to allocate plain Vector{UInt64}s, but I also ran into situations where the memory was still getting exhausted. My next attempt was going to try and avoid mmap on windows and use something more like VirtualAlloc, hoping it behaves a little better (like unix) in these large memory scenarios.

4 Likes

OK thanks for the explanation. (I apologize for not looking at GitHub issues first).

But since we’re here, can I just clarify that the GitHub issue and your comment above relate to parsing a file row by row? The GitHub issue is with CSV.read, which is presumably trying to read a large file into memory, rather than processing a file line by line.

I was hoping there is a way to stream arbitrarily large files without running into memory limitations. And the way to achieve this with CSV is by iterating rows of CSV.File. Is this not the case?

In any case, I think I’ll try somewhat more primitive code customised to my use-case. Hopefully this will allow “streaming” without memory limitations.

Replacing:

using CSV
for row in CSV.File(filename)
    # do stuff with row
end

with

const column_types = ...
for line in eachline(filename)
    fields = parse.(column_types, split(line, ","))
    # do stuff with fields
end

The API for CSV.File has evolved a bit in this respect (i.e. iterating rows). The problem is that without parsing the entire file, you can’t fully know the types of each column; so while there are more obvious ways to parse a csv file row-by-row w/ minimal memory usage, there’s no way to do that with reliable type information. Previously, CSV.File did try to be more minimal when iterating rows vs. parsing entire columns, but it led to lots of bugs and basically two separate large code paths that were supposed to produce the same results.

All of that is to just to justify the much simpler (implementation-wise), completely accurate current approach of doing a full pass over the file, producing the CSV.File object, which then allows easy iterating or column access.

Now, I do think there’s still API room here to come up w/ a good hybrid solution. Two ideas I’ve been noodling on are:

  • A CSV.View object: we’d do a very fast pass over the file to just parse the location of each cell, but disregard type information all together. Then we’d provide the same operations as CSV.File, but each column would just be Strings. It would be blazing fast “parsing”, and then the user could manually convert to a more specific type as needed
  • A CSV.Rows object: we’d still do the initial file/delimiter detection of CSV.File, but like CSV.View, ignore all type information. The CSV.Rows type would allow this very minimal memory footprint of iterating rows of a csv file, treating each cell as a String, and the user could manually specify types.

Neither of those two options would be very difficult, because we’d be able to re-use most of the existing functionality. If people are interested, I could dive in right away since I’ve already been thinking about these ideas for a while. Would people be interested in exposing those kinds of access patterns to csv files?

5 Likes

Very much interested, i’ve personnaly implemented something simple an similar using some methods of previous version of csv. Parsing everything to string in file chunks in parallel, parsing to specific types using vectorized operations in each chunk and writing to db (mongodb), it proved to be very fast and with a low memory footprint.

2 Likes

I think that there are two, mostly distinct use cases for reading CSV data:

  1. Small files, where you basically don’t care about the number of passes, and in exchange you get benefits like type information etc. “Small” is a relative term, of course, and on today’s computers even a 100GB file is “small”.

  2. Large files, most of the time compressed. In this case you really don’t want to do two passes, and the goal is usually to ingest the data into some sane format for further analysis. This involves some necessary trade-offs about type or even size information before reading the data.

I think that in practice it is fine if these two uses cases are served by different libraries, and CSV.jl is for the first one.

Yes, absolutely!

Yep, this seems to check a lot of boxes for me: chunking, parallel, rewriting, fast, low memory.