Somewhat faster text/numeric io


I had to read an 80GB file of text-formatted numbers recently on a machine with 2TB of RAM. So I just loaded the full text into memory (about 600s), and then had to get it into actual number. In my case, it was a mix of ints and floats. But all the ints were small, so they would fit exactly in a float with no truncation.

Consequently, I wrote the following reader to directly call Julia’s float parser. This bypasses all sorts of overhead with strings.

In my quick tests, it’s about 1.5-2x faster than any other way I’ve found to read numbers-in-text in Julia. It’s actually even faster than a C++ implementation with fscanf("%ld") !


  • I have not tested this extensively. It works for me at the moment, but there could be some off-by-one errors in reloading the buffers. Do not use this for production code unless you verify it more extensively!!
  • At the moment, it assumes one space between numbers. (This should be easy to remedy with the next_non_space function, but I haven’t done it yet.)
  • In theory, the second buffer (buf2) only gets used to avoid creating a view on the current buffer to re-load with bytes. There could be other ways of doing this.
  • In case you are wondering, I didn’t end up using this one, I simply wrote it while the other codes were running. CSV.jl took a little over an hour to read the 80GB file, once it was loaded into memory. This code took about 25 minutes.

I’m posting it here in case anyone sees any value in developing it further. I don’t have time for that right now, but it could probably be adapted to something designed for pure numbers-as-text reading, which would be so nice to have!

Read a text file consistent entirely of floating point numbers separated
by single spaces or newlines.

I saw that this was about 1.5-2x faster than CSV.jl, which was the fastest
of the existing read methods I found.

@inline mytryparse(::Type{Float64}, s::Vector{UInt8}, pos::Int64, len::Int64) = 
            ccall(:jl_try_substrtod, Nullable{Float64}, (Ptr{UInt8},Csize_t,Csize_t), s, pos, len)
@inline function myparse(::Type{Float64}, s::Vector{UInt8}, pos::Int64, last::Int64) 
    result = mytryparse(Float64, s, pos-1, last-pos+1)
    if isnull(result)
        throw(ArgumentError("cannot parse $(repr(s)) as $Float64"))
    return unsafe_get(result)

@inline function myparse(::Type{Int32}, s::Vector{UInt8}, pos::Int64, last::Int64)
    val = myparse(Float64, s, pos, last)
    return convert(Int32, val)

@inline function next_space(a::Vector{UInt8}, pos, len)
    @inbounds for i = pos:len
        if a[i] == UInt8(' ') || a[i] == UInt8('\n') || a[i] == UInt8('\t')
            return i
    return -1

@inline function next_non_space(a::Vector{UInt8}, pos, len)
    @inbounds for i = pos:len
        if a[i] != UInt8(' ') && a[i] != UInt8('\n') && a[i] != UInt8('\t')
            return i
    return -1

function tseq(io::IO, a; maxbuf::Int=2^10)
    #maxbuf = 2^10 # 64k bytes
    buf = Vector{UInt8}(maxbuf) # 64k bytes
    buf2 = Vector{UInt8}(maxbuf) # 64k bytes
    nb = readbytes!(io, buf)
    if nb == 0
        return a # just return 
    curspace = next_non_space(buf, 1, nb)
    @inbounds while nb >= 0 && curspace >= 0
        nextspace = next_space(buf, curspace+1, nb)
        if nextspace >= 0
            #@show "parsing", String(buf[curspace:nextspace])
            push!(a, myparse(Float64, buf, curspace, nextspace))
            # we didn't see a space, that means we need to read more buffer
            # move things to beginning
            #print("Refilling: ", replace(String(buf),"\n","@"), " ", curspace, " ", String(buf[curspace:nb]), "\n")
            copy!(buf, 1, buf, curspace, nb - curspace + 1)
            #print("     Move: ", replace(String(buf),"\n","@"), "\n")
            bufstart = nb - curspace + 2
            #buffree = maxbuf - bufstart + 1
            curspace = 1
            # Try reading
            #nread = readbytes!(io, @view(buf[bufstart:end]))
            nread = readbytes!(io, buf2, maxbuf-bufstart+1)
            copy!(buf, bufstart, buf2, 1, maxbuf-bufstart+1)
            #print("   Reload: ", replace(String(buf),"\n","@"), "\n")
            #@show nread            
            if nread == 0 
                # We couldn't read any more, that means we are at the end!
                #print("    Final: ", replace(String(buf),"\n","@"), "\n")
                #@show String(buf[curspace:bufstart-1])
                if bufstart > 2
                    # then there is stuff to process!
                    push!(a, myparse(Float64, buf, curspace, bufstart-1))
                    return a
                nb = bufstart + nread - 1
                #@show String(buf[curspace:nb])
        curspace = nextspace
    return a

function read_file_to_float64_array(filename::AbstractString)
	a = zeros(0)
	open(filename, "r") do fh
		return tseq(fh, a)


I’m going to announce a package soon based on these ideas. But for a preview, see

Here is some sample code.

using NumbersFromText
M = readmatrix("myfile.txt") # reads a matrix of data
M = readmatrix(Int, "myfile.txt") # reads a matrix of data
m = readarray("myfile.txt") # just reads a list of Float64s from myfile.txt
m = readarray(Int, "myfile.txt") # just reads a list of Ints from myfile.txt
m = readarray!("myfile.txt", rand(Int, 5)) # read Ints into an existing array
aint, afloat = readarrays("myfile.txt", Int, Float64) # reads alternating Ints and Floats
aint, afloat = readarrays!("myfile.txt", rand(Int,5), rand(Float64,5)) # read into existing arrays

Everything works with IO streams as well.

In my in-memory processing tests, this is about 2x CSV.jl (which is the fastest I’ve seen otherwise.)

I get about 32 million integers is about 2.7-2.9 seconds on my cmputer (so about 10M integers/sec.) Note that reading from disk is still not the limit as this data is about 700MB, so we need about 200MB/sec, which isn’t hard from a SSD. (These are done quickly, so I apologize if I made a mistake.)

I’m still hunting for bugs, so be warned.