In order to use julia for all the things, and to learn more about threads, I’m trying to parse a webserver log file using threads. In theory, each line is sent to a thread, a bunch of parsing functions run, the result saved to a dataframe, then dataframe written out to a parquet file.
countlines(logfile) shows 247 million lines. Parts of the logfile contain binary data, so I try to handle it before calling a bunch of functions for data parsing. julia is run with
-t auto and correctly sees the 16 cores.
From BenchmarkTools, the slowest part of the entire process is reading from the logfile (on NAS over a 1 gbps network). I thought that threads would allow the i/o wait to be less relevant.
Here’s what didn’t work:
Threads.@threads for rawline::String in eachline(logfile) line::String = try Unicode.normalize(rawline, :NFKC) catch return nothing end parser_function_one(line) parser_function_two(line) end
Obviously that didn’t work because @threads wants an iterator, not a Base.EachLine type. So, I tried this, UPDATED from original post:
Threads.@threads for rawline::String in collect(eachline(logfile)) line::String = try Unicode.normalize(rawline, :NFKC) catch return nothing end parser_function_one(line) parser_function_two(line) end
I now have DataFrame errors, so more spelunking on my end. I thought DataFrames.jl was thread-safe, but maybe not the way I’m using it.
Pointers? What’s a better way to do this?