Dear All,
I would like to ask a question about a behavior of multi-threadding. My usecase is following. I have on disk 800 000 files containing compressed JSONs, and I want to find those containing some predefined set of keys (I use term matching). This problem can be trivially paralellized as follows:
using Folds, JSON3
@elapsed Folds.map(files) do f
js = JSON3.read(IOBuffer(transcode(LZ4FrameDecompressor, read(f))), Dict{String,Any})
match(rs, js)
end
Which takes 603s on 32 physical core machine and I use 32 threads. Note that match(rs, js)
is the matching function which matches pattern rs
to json js
.
What is surprising to me is that if I read 5000 files sequentially and matching them in parallel as follows
@elapsed map(enumerate(Iterators.partition(files, 5000))) do (chunkid, chunk_files)
buffers = map(read, chunk_files)
Folds.map(buffers) do buf
js = JSON3.read(IOBuffer(transcode(LZ4FrameDecompressor, buf)), Dict{String,Any})
match(rs, js)
end
end
the execution takes 540s, which is 10% faster. Nice, may-be, there are some delays caused by IO locks (but I have thought that this is resolved now).
But what is even more confusing, is that if I put all buffers into one large array and partition that array in threads (which makes the code ugly)
@elapsed map(enumerate(Iterators.partition(files, 5000))) do (chunkid, chunk_files)
index = DataFrame(key=String[], chunk=Int[], offset=Int[], size=Int[])
buffer = Vector{UInt8}()
for f in chunk_files
compressed = read(f)
push!(index, (f, chunkid, length(buffer), length(compressed)))
append!(buffer, compressed)
end
jsons = Folds.map(eachrow(df)) do row
buf = buffer[row.offset+1:row.offset+row.size]
js = JSON3.read(IOBuffer(transcode(LZ4FrameDecompressor, buf)), Dict{String,Any})
match(rs, js)
end
end
the execution time is 460s. Vow. So doing more work is faster then doing less. What? I apologize that I cannot provide MWE, but without those file, it might not make sense. I am primarily interested if there is some known phenomenon that I am not aware of.
Thanks for help in advance.