In order to use julia for all the things, and to learn more about threads, I’m trying to parse a webserver log file using threads. In theory, each line is sent to a thread, a bunch of parsing functions run, the result saved to a dataframe, then dataframe written out to a parquet file. countlines(logfile)
shows 247 million lines. Parts of the logfile contain binary data, so I try to handle it before calling a bunch of functions for data parsing. julia is run with -t auto
and correctly sees the 16 cores.
From BenchmarkTools, the slowest part of the entire process is reading from the logfile (on NAS over a 1 gbps network). I thought that threads would allow the i/o wait to be less relevant.
Here’s what didn’t work:
Threads.@threads for rawline::String in eachline(logfile)
line::String = try
Unicode.normalize(rawline, :NFKC)
catch
return nothing
end
parser_function_one(line)
parser_function_two(line)
end
Obviously that didn’t work because @threads wants an iterator, not a Base.EachLine type. So, I tried this, UPDATED from original post:
Threads.@threads for rawline::String in collect(eachline(logfile))
line::String = try
Unicode.normalize(rawline, :NFKC)
catch
return nothing
end
parser_function_one(line)
parser_function_two(line)
end
I now have DataFrame errors, so more spelunking on my end. I thought DataFrames.jl was thread-safe, but maybe not the way I’m using it.
Pointers? What’s a better way to do this?
Thanks!