Hi Julia Community,
What I am trying to do is very simple and generic. Read a file (usually very large files) line by line and process every line independently. Lets say split every line of a tab-delim numeric table to construct a matrix. I found most of the time is spending on processing rather than IO. So I wonder are there ways to leverage the parallel feature to improve performance. Conceptually, one might use multiple cores to split so that it could make good use of the IO resources. Following codes are one of the things I routinely do.
input file: test.csv (It has 50000 lines)
10,3017090,3017138,1
10,3017138,3017140,2
10,3017140,3017188,1
10,3083687,3083737,1
Parse it into Array{Int64}(5000,4)
I count the lines and times it as an indicator of IO performance
function count_file(fn)
num_line = 0
open(fn) do f
for line in eachline(f)
num_line += 1
end
end
return num_line
end
It takes roughly 3.5ms to iterate the whole file.
When I try to do something to every line, parsing it to 4 numbers
Define function for split and parse each line.
function spliter(line::String, delim::Char)
return map(x->parse(Int64,x), split(line, delim))
end
Read file and process each line
function single_spliter(fn)
result = zeros(Int64, 50000,4)
index = 1
open(fn) do f
for line in eachline(f)
result[index, :] = spliter(line,',')
index+=1
end
end
return result
end
Any ideas? Thanks
Note: simply splitting without spending time on array access also take 35ms, 10X longer than reading through (count_lines)
function split_only(fn)
open(fn) do f
for line in eachline(f)
split(line,',')
end
end
end