Somewhat faster text/numeric io


#1

I had to read an 80GB file of text-formatted numbers recently on a machine with 2TB of RAM. So I just loaded the full text into memory (about 600s), and then had to get it into actual number. In my case, it was a mix of ints and floats. But all the ints were small, so they would fit exactly in a float with no truncation.

Consequently, I wrote the following reader to directly call Julia’s float parser. This bypasses all sorts of overhead with strings.

In my quick tests, it’s about 1.5-2x faster than any other way I’ve found to read numbers-in-text in Julia. It’s actually even faster than a C++ implementation with fscanf("%ld") !

Notes

  • I have not tested this extensively. It works for me at the moment, but there could be some off-by-one errors in reloading the buffers. Do not use this for production code unless you verify it more extensively!!
  • At the moment, it assumes one space between numbers. (This should be easy to remedy with the next_non_space function, but I haven’t done it yet.)
  • In theory, the second buffer (buf2) only gets used to avoid creating a view on the current buffer to re-load with bytes. There could be other ways of doing this.
  • In case you are wondering, I didn’t end up using this one, I simply wrote it while the other codes were running. CSV.jl took a little over an hour to read the 80GB file, once it was loaded into memory. This code took about 25 minutes.

I’m posting it here in case anyone sees any value in developing it further. I don’t have time for that right now, but it could probably be adapted to something designed for pure numbers-as-text reading, which would be so nice to have!

"""
Read a text file consistent entirely of floating point numbers separated
by single spaces or newlines.

I saw that this was about 1.5-2x faster than CSV.jl, which was the fastest
of the existing read methods I found.
"""


@inline mytryparse(::Type{Float64}, s::Vector{UInt8}, pos::Int64, len::Int64) = 
            ccall(:jl_try_substrtod, Nullable{Float64}, (Ptr{UInt8},Csize_t,Csize_t), s, pos, len)
            
@inline function myparse(::Type{Float64}, s::Vector{UInt8}, pos::Int64, last::Int64) 
    result = mytryparse(Float64, s, pos-1, last-pos+1)
    if isnull(result)
        throw(ArgumentError("cannot parse $(repr(s)) as $Float64"))
    end
    return unsafe_get(result)
end    


@inline function myparse(::Type{Int32}, s::Vector{UInt8}, pos::Int64, last::Int64)
    val = myparse(Float64, s, pos, last)
    return convert(Int32, val)
end

@inline function next_space(a::Vector{UInt8}, pos, len)
    @inbounds for i = pos:len
        if a[i] == UInt8(' ') || a[i] == UInt8('\n') || a[i] == UInt8('\t')
            return i
        end
    end
    return -1
end

@inline function next_non_space(a::Vector{UInt8}, pos, len)
    @inbounds for i = pos:len
        if a[i] != UInt8(' ') && a[i] != UInt8('\n') && a[i] != UInt8('\t')
            return i
        end
    end
    return -1
end
    

function tseq(io::IO, a; maxbuf::Int=2^10)
    #maxbuf = 2^10 # 64k bytes
    buf = Vector{UInt8}(maxbuf) # 64k bytes
    buf2 = Vector{UInt8}(maxbuf) # 64k bytes
    nb = readbytes!(io, buf)
    if nb == 0
        return a # just return 
    end
    
    curspace = next_non_space(buf, 1, nb)
    
    @inbounds while nb >= 0 && curspace >= 0
        nextspace = next_space(buf, curspace+1, nb)
        if nextspace >= 0
            #@show "parsing", String(buf[curspace:nextspace])
            push!(a, myparse(Float64, buf, curspace, nextspace))
        else
            # we didn't see a space, that means we need to read more buffer
            # move things to beginning
            #print("Refilling: ", replace(String(buf),"\n","@"), " ", curspace, " ", String(buf[curspace:nb]), "\n")
            copy!(buf, 1, buf, curspace, nb - curspace + 1)
            #print("     Move: ", replace(String(buf),"\n","@"), "\n")
            bufstart = nb - curspace + 2
            #buffree = maxbuf - bufstart + 1
            curspace = 1
            
            # Try reading
            #nread = readbytes!(io, @view(buf[bufstart:end]))
            nread = readbytes!(io, buf2, maxbuf-bufstart+1)
            copy!(buf, bufstart, buf2, 1, maxbuf-bufstart+1)
            
            #print("   Reload: ", replace(String(buf),"\n","@"), "\n")
            
            #@show nread            
            
            if nread == 0 
                # We couldn't read any more, that means we are at the end!
                #print("    Final: ", replace(String(buf),"\n","@"), "\n")
                #@show String(buf[curspace:bufstart-1])
                if bufstart > 2
                    # then there is stuff to process!
                    push!(a, myparse(Float64, buf, curspace, bufstart-1))
                    return a
                end
                
            else
                nb = bufstart + nread - 1
                #@show String(buf[curspace:nb])
                continue 
            end
                
        end
        curspace = nextspace
    end
    return a
end

function read_file_to_float64_array(filename::AbstractString)
	a = zeros(0)
	open(filename, "r") do fh
		return tseq(fh, a)
	end
end


#2

I’m going to announce a package soon based on these ideas. But for a preview, see

Here is some sample code.

using NumbersFromText
M = readmatrix("myfile.txt") # reads a matrix of data
M = readmatrix(Int, "myfile.txt") # reads a matrix of data
m = readarray("myfile.txt") # just reads a list of Float64s from myfile.txt
m = readarray(Int, "myfile.txt") # just reads a list of Ints from myfile.txt
m = readarray!("myfile.txt", rand(Int, 5)) # read Ints into an existing array
aint, afloat = readarrays("myfile.txt", Int, Float64) # reads alternating Ints and Floats
aint, afloat = readarrays!("myfile.txt", rand(Int,5), rand(Float64,5)) # read into existing arrays

Everything works with IO streams as well.

In my in-memory processing tests, this is about 2x CSV.jl (which is the fastest I’ve seen otherwise.)

I get about 32 million integers is about 2.7-2.9 seconds on my cmputer (so about 10M integers/sec.) Note that reading from disk is still not the limit as this data is about 700MB, so we need about 200MB/sec, which isn’t hard from a SSD. (These are done quickly, so I apologize if I made a mistake.)

I’m still hunting for bugs, so be warned.