Parallel Processing File

Donut_Meepo · August 27, 2018, 11:28pm

Hi Julia Community,

What I am trying to do is very simple and generic. Read a file (usually very large files) line by line and process every line independently. Lets say split every line of a tab-delim numeric table to construct a matrix. I found most of the time is spending on processing rather than IO. So I wonder are there ways to leverage the parallel feature to improve performance. Conceptually, one might use multiple cores to split so that it could make good use of the IO resources. Following codes are one of the things I routinely do.

input file: test.csv (It has 50000 lines)
10,3017090,3017138,1
10,3017138,3017140,2
10,3017140,3017188,1
10,3083687,3083737,1

Parse it into Array{Int64}(5000,4)

I count the lines and times it as an indicator of IO performance

function count_file(fn)
    num_line = 0
    open(fn) do f
        for line in eachline(f)
            num_line += 1
        end
    end
    return num_line
end

It takes roughly 3.5ms to iterate the whole file.

When I try to do something to every line, parsing it to 4 numbers

Define function for split and parse each line.

function spliter(line::String, delim::Char)
    return map(x->parse(Int64,x), split(line, delim))
end

Read file and process each line

function single_spliter(fn)
    result = zeros(Int64, 50000,4)
    index = 1
    open(fn) do f
        for line in eachline(f)
            result[index, :] = spliter(line,',')
            index+=1
        end
    end
    return result
end

Any ideas? Thanks

Note: simply splitting without spending time on array access also take 35ms, 10X longer than reading through (count_lines)

function split_only(fn)
    open(fn) do f
        for line in eachline(f)
            split(line,',')
        end
    end
end

Juser · August 27, 2018, 11:47pm

Splitting requires Julia to actually read the array, look for commas, then create the appropriate array of substrings. This is much more intensive than simply adding 1 to an Int64.

Since you already know the size of your output, it’s probably safe to use a threaded loop like: Threads.@threads (line,index) for line in enumerate(eachline(f)).

Donut_Meepo · August 28, 2018, 2:13am

Since I don’t have experiences on threads, I am trying to write an distributed version of what you suggested. But I could not get it to work.

result = SharedArray{Int64,2}((50000,4))
function parallel_spliter!(fn, result::SharedArray)
    open(fn) do f
        @distributed for (index,line) in enumerate(eachline(f))
            result[index, :] = spliter(line,',')
        end
    end
end

I think maybe eachline() has to work sequentially while @distributed does random iteration. I am not quit sure.

Juser · August 29, 2018, 3:25am

I suspect that this is because eachline(f) doesn’t know the length of the file ex-ante. To do it all in one loop requires a pmap or tmap style parallelism in which the workers come back after each line and check if there is a new line. Unfortunately, I don’t really use either, so I am not sure about the appropriate syntax. @distributed and Threads.@threads want to know the length of the task ex-ante to split the whole job up beforehand (which is most efficient if the number of tasks is known, the length of the tasks is independent of the order of the tasks).

Topic		Replies	Views
Using Threads with I/O to processing many files in parallel New to Julia	3	956	December 23, 2016
Reading and processing Data files concurrently Data parallel	18	3859	September 20, 2017
Processing csv's in parallel General Usage question	8	1518	February 4, 2018
Help with parallel computing for a simple loop with a large function New to Julia	4	1674	October 25, 2019
Questions about getting started with parallel computing Julia at Scale	18	5853	June 22, 2019

Parallel Processing File

Related topics