I had to read an 80GB file of text-formatted numbers recently on a machine with 2TB of RAM. So I just loaded the full text into memory (about 600s), and then had to get it into actual number. In my case, it was a mix of ints and floats. But all the ints were small, so they would fit exactly in a float with no truncation.
Consequently, I wrote the following reader to directly call Julia’s float parser. This bypasses all sorts of overhead with strings.
In my quick tests, it’s about 1.5-2x faster than any other way I’ve found to read numbers-in-text in Julia. It’s actually even faster than a C++ implementation with fscanf(“%ld”) !
- I have not tested this extensively. It works for me at the moment, but there could be some off-by-one errors in reloading the buffers. Do not use this for production code unless you verify it more extensively!!
- At the moment, it assumes one space between numbers. (This should be easy to remedy with the next_non_space function, but I haven’t done it yet.)
- In theory, the second buffer (buf2) only gets used to avoid creating a view on the current buffer to re-load with bytes. There could be other ways of doing this.
- In case you are wondering, I didn’t end up using this one, I simply wrote it while the other codes were running. CSV.jl took a little over an hour to read the 80GB file, once it was loaded into memory. This code took about 25 minutes.
I’m posting it here in case anyone sees any value in developing it further. I don’t have time for that right now, but it could probably be adapted to something designed for pure numbers-as-text reading, which would be so nice to have!
Read a text file consistent entirely of floating point numbers separated
by single spaces or newlines.
I saw that this was about 1.5-2x faster than CSV.jl, which was the fastest
of the existing read methods I found.
@inline mytryparse(::Type{Float64}, s::Vector{UInt8}, pos::Int64, len::Int64) =
ccall(:jl_try_substrtod, Nullable{Float64}, (Ptr{UInt8},Csize_t,Csize_t), s, pos, len)
@inline function myparse(::Type{Float64}, s::Vector{UInt8}, pos::Int64, last::Int64)
result = mytryparse(Float64, s, pos-1, last-pos+1)
if isnull(result)
throw(ArgumentError("cannot parse $(repr(s)) as $Float64"))
return unsafe_get(result)
@inline function myparse(::Type{Int32}, s::Vector{UInt8}, pos::Int64, last::Int64)
val = myparse(Float64, s, pos, last)
return convert(Int32, val)
@inline function next_space(a::Vector{UInt8}, pos, len)
@inbounds for i = pos:len
if a[i] == UInt8(' ') || a[i] == UInt8('\n') || a[i] == UInt8('\t')
return i
return -1
@inline function next_non_space(a::Vector{UInt8}, pos, len)
@inbounds for i = pos:len
if a[i] != UInt8(' ') && a[i] != UInt8('\n') && a[i] != UInt8('\t')
return i
return -1
function tseq(io::IO, a; maxbuf::Int=2^10)
#maxbuf = 2^10 # 64k bytes
buf = Vector{UInt8}(maxbuf) # 64k bytes
buf2 = Vector{UInt8}(maxbuf) # 64k bytes
nb = readbytes!(io, buf)
if nb == 0
return a # just return
curspace = next_non_space(buf, 1, nb)
@inbounds while nb >= 0 && curspace >= 0
nextspace = next_space(buf, curspace+1, nb)
if nextspace >= 0
#@show "parsing", String(buf[curspace:nextspace])
push!(a, myparse(Float64, buf, curspace, nextspace))
# we didn't see a space, that means we need to read more buffer
# move things to beginning
#print("Refilling: ", replace(String(buf),"\n","@"), " ", curspace, " ", String(buf[curspace:nb]), "\n")
copy!(buf, 1, buf, curspace, nb - curspace + 1)
#print(" Move: ", replace(String(buf),"\n","@"), "\n")
bufstart = nb - curspace + 2
#buffree = maxbuf - bufstart + 1
curspace = 1
# Try reading
#nread = readbytes!(io, @view(buf[bufstart:end]))
nread = readbytes!(io, buf2, maxbuf-bufstart+1)
copy!(buf, bufstart, buf2, 1, maxbuf-bufstart+1)
#print(" Reload: ", replace(String(buf),"\n","@"), "\n")
#@show nread
if nread == 0
# We couldn't read any more, that means we are at the end!
#print(" Final: ", replace(String(buf),"\n","@"), "\n")
#@show String(buf[curspace:bufstart-1])
if bufstart > 2
# then there is stuff to process!
push!(a, myparse(Float64, buf, curspace, bufstart-1))
return a
nb = bufstart + nread - 1
#@show String(buf[curspace:nb])
curspace = nextspace
return a
function read_file_to_float64_array(filename::AbstractString)
a = zeros(0)
open(filename, "r") do fh
return tseq(fh, a)