Parallel Processing File

Hi Julia Community,

What I am trying to do is very simple and generic. Read a file (usually very large files) line by line and process every line independently. Lets say split every line of a tab-delim numeric table to construct a matrix. I found most of the time is spending on processing rather than IO. So I wonder are there ways to leverage the parallel feature to improve performance. Conceptually, one might use multiple cores to split so that it could make good use of the IO resources. Following codes are one of the things I routinely do.

input file: test.csv (It has 50000 lines)
10,3017090,3017138,1
10,3017138,3017140,2
10,3017140,3017188,1
10,3083687,3083737,1

Parse it into Array{Int64}(5000,4)

I count the lines and times it as an indicator of IO performance

function count_file(fn)
    num_line = 0
    open(fn) do f
        for line in eachline(f)
            num_line += 1
        end
    end
    return num_line
end


It takes roughly 3.5ms to iterate the whole file.

When I try to do something to every line, parsing it to 4 numbers

Define function for split and parse each line.

function spliter(line::String, delim::Char)
    return map(x->parse(Int64,x), split(line, delim))
end

Read file and process each line

function single_spliter(fn)
    result = zeros(Int64, 50000,4)
    index = 1
    open(fn) do f
        for line in eachline(f)
            result[index, :] = spliter(line,',')
            index+=1
        end
    end
    return result
end

Any ideas? Thanks

Note: simply splitting without spending time on array access also take 35ms, 10X longer than reading through (count_lines)

function split_only(fn)
    open(fn) do f
        for line in eachline(f)
            split(line,',')
        end
    end
end

Splitting requires Julia to actually read the array, look for commas, then create the appropriate array of substrings. This is much more intensive than simply adding 1 to an Int64.

Since you already know the size of your output, it’s probably safe to use a threaded loop like: Threads.@threads (line,index) for line in enumerate(eachline(f)).

Since I don’t have experiences on threads, I am trying to write an distributed version of what you suggested. But I could not get it to work.

result = SharedArray{Int64,2}((50000,4))
function parallel_spliter!(fn, result::SharedArray)
    open(fn) do f
        @distributed for (index,line) in enumerate(eachline(f))
            result[index, :] = spliter(line,',')
        end
    end
end

image

I think maybe eachline() has to work sequentially while @distributed does random iteration. I am not quit sure.

I suspect that this is because eachline(f) doesn’t know the length of the file ex-ante. To do it all in one loop requires a pmap or tmap style parallelism in which the workers come back after each line and check if there is a new line. Unfortunately, I don’t really use either, so I am not sure about the appropriate syntax. @distributed and Threads.@threads want to know the length of the task ex-ante to split the whole job up beforehand (which is most efficient if the number of tasks is known, the length of the tasks is independent of the order of the tasks).