Multi-thread parsing json

Hi, I am loading lots of gzipped jsons, parsing and further processing them.
TL;DR: Why is map much slower than list comprehension/dot notation when using in tmap on multiple threads?

I am trying to use multiple threads for it, using library https://github.com/baggepinnen/ThreadTools.jl to make the processing faster, but I am seeing weird performance inconsistency and I just don’t get why this happens.
Note: I am using external process to unzip the json because it is faster than unzipping using https://github.com/bicycle1885/TranscodingStreams.jl

I am doing this benchmark on Julia 1.3.0, Windows 10, CPU Core 7i-9850H.
I am benchmarking both only loading the zipped json to string and loading zipped json+parsing it

using ThreadTools, BenchmarkTools
function load_shas(shas)
	[read(`7z e -so $(jsondir(sha)).json.gz`, String) for sha in shas]
end

function load_parse_shas(shas)
    [JSON.parse(read(`7z e -so $(jsondir(sha)).json.gz`, String)) for sha in shas]
end

load_parse_sha(sha) = JSON.parse(read(`7z e -so $(jsondir(sha)).json.gz`, String))

println("foreach read")
@btime foreach(sha->read(`7z e -so $(jsondir(sha)).json.gz`, String), shas[1:1000])
# 16.029 s (1086474 allocations: 2.37 GiB)
println("map read")
@btime map(sha->read(`7z e -so $(jsondir(sha)).json.gz`, String), shas[1:1000])
# 15.986 s (1084171 allocations: 2.37 GiB)
println("tmap function read")
@btime collect(Iterators.flatten(tmap(load_shas, Iterators.partition(shas[1:1000], 50))))
# 5.872 s (876144 allocations: 2.35 GiB)
println("tmap map read")
@btime collect(Iterators.flatten(tmap(x->map(sha->read(`7z e -so $(jsondir(sha)).json.gz`, String), x), Iterators.partition(shas[1:1000], 50))))
# 6.361 s (874188 allocations: 2.31 GiB)
println("tmap list comprehensions read")
@btime collect(Iterators.flatten(tmap(x->[read(`7z e -so $(jsondir(sha)).json.gz`, String) for sha in x], Iterators.partition(shas[1:1000], 50))))
# 6.289 s (877305 allocations: 2.37 GiB)

println("foreach read parse")
@btime foreach(sha->JSON.parse(read(`7z e -so $(jsondir(sha)).json.gz`, String)), shas[1:1000])
# 22.340 s (76891104 allocations: 7.38 GiB)
println("map read parse")
@btime map(sha->JSON.parse(read(`7z e -so $(jsondir(sha)).json.gz`, String)), shas[1:1000])
# 42.694 s (76889028 allocations: 7.37 GiB)
println("tmap function read parse")
@btime collect(Iterators.flatten(tmap(load_parse_shas, Iterators.partition(shas[1:1000], 50))))
# 19.709 s (76697131 allocations: 7.34 GiB)
println("tmap map read parse")
@btime collect(Iterators.flatten(tmap(x->map(sha->JSON.parse(read(`7z e -so $(jsondir(sha)).json.gz`, String)), x), Iterators.partition(shas[1:1000], 50))))
# 19.942 s (76695967 allocations: 7.31 GiB)
println("tmap list comprehensions read parse")
@btime collect(Iterators.flatten(tmap(x->[JSON.parse(read(`7z e -so $(jsondir(sha)).json.gz`, String)) for sha in x], Iterators.partition(shas[1:1000], 50))))
# 22.213 s (76689232 allocations: 7.29 GiB)
println("tmap function read_parse dot")
@btime collect(Iterators.flatten(tmap(x->load_parse_sha.(x), Iterators.partition(shas[1:1000], 50))))
# 22.235 s (76688233 allocations: 7.33 GiB)

and I can not get my head around this:
using map in tmap takes ~19 sec., and the list comprehension and dot notation take 22 seconds, which is slower, although it should do same thing.

I am using partition to lists of size 50 because I am then calculating some aggregated statistics and them merging them for those blocks.

EDIT: my bad, because of typo I used different variable sometimes, after rerunning the code again, correctly, the results are bit different than in original post.

1 Like

Have you tried @code_warntype? Perhaps the compiler is correctly inferring types for some versions and not others.

Have you also tested equality of all the outputs you’re generating?

Also looks like shas is a global? You should substitute that in with $ or (sometimes better) make it a function argument.

Check out this for more: https://docs.julialang.org/en/v1/manual/performance-tips/

This shouldn’t be an issue here.

Since you are just doing IO, why not use asyncmap?

2 Likes

You’ll probably get a reasonable speedup by using JSON3 rather than JSON as well.

2 Likes

There are many places this expression is used. But I don’t see the variable s defined. I think you have it in the global scope in your REPL session. Maybe the examples with this bug is faster as the OS keeps the file in cache or something.

Yeah, that was my bad, I forgot that variable, so it was loading same json again and again.
After fixing this and running all those benchmarks again (now the post contains times after rerunning it correctly), I obtained much smaller time gap between different ways of iteration, but the gap of 3 seconds is still noticable.

I tried the JSON3, but since I am always crunching the whole json file, the partial lazy loading shown no speedup at all on multiple threads when compared to JSON.
And when I deployed the script on server, it caused segfault, looks like JSON3 does not handle well the thread based parallelization on 24 threads. :frowning:

What happened to your old benchmarks?

JSON3 is sometimes (often?) faster than JSON even when you’re not lazy loading, too, so that’s interesting.

@quinnj might be interested in how JSON3 failed for you when multithreading.

I realized that I was benchmarking only the loading+parsing and of course JSON3 was almost as fast as just loading without parsing, because all the lazy-loaded things didn’t actually load.
I’ll post here benchmarks with the processing of jsons once it will finish.

The whole segfault exception is here:

ERROR: LoadError: 
signal (11): Segmentation fault
in expression starting at none:0
__memmove_ssse3_back at /lib64/libc.so.6 (unknown line)
_ZN4llvm15RuntimeDyldImpl11emitSectionERKNS_6object10ObjectFileERKNS1_10SectionRefEb at /home/matej.racinsky/projects/SkunkAV/julia-1.3.0-rc4/bin/../lib/julia/libLLVM-6.0.so (unknown line)
_ZN4llvm15RuntimeDyldImpl17findOrEmitSectionERKNS_6object10ObjectFileERKNS1_10SectionRefEbRSt3mapIS5_jSt4lessIS5_ESaISt4pairIS6_jEEE at /home/matej.racinsky/projects/SkunkAV/julia-1.3.0-rc4/bin/../lib/julia/libLLVM-6.0.so (unknown line)
_ZN4llvm15RuntimeDyldImpl14loadObjectImplERKNS_6object10ObjectFileE at /home/matej.racinsky/projects/SkunkAV/julia-1.3.0-rc4/bin/../lib/julia/libLLVM-6.0.so (unknown line)
_ZN4llvm14RuntimeDyldELF10loadObjectERKNS_6object10ObjectFileE at /home/matej.racinsky/projects/SkunkAV/julia-1.3.0-rc4/bin/../lib/julia/libLLVM-6.0.so (unknown line)
_ZN4llvm11RuntimeDyld10loadObjectERKNS_6object10ObjectFileE at /home/matej.racinsky/projects/SkunkAV/julia-1.3.0-rc4/bin/../lib/julia/libLLVM-6.0.so (unknown line)
operator() at /buildworker/worker/package_linux64/build/usr/include/llvm/ExecutionEngine/Orc/RTDyldObjectLinkingLayer.h:267 [inlined]
finalize at /buildworker/worker/package_linux64/build/usr/include/llvm/ExecutionEngine/Orc/RTDyldObjectLinkingLayer.h:144 [inlined]
emitAndFinalize at /buildworker/worker/package_linux64/build/usr/include/llvm/ExecutionEngine/Orc/RTDyldObjectLinkingLayer.h:343 [inlined]
emitAndFinalize at /buildworker/worker/package_linux64/build/usr/include/llvm/ExecutionEngine/Orc/IRCompileLayer.h:91 [inlined]
addModule at /buildworker/worker/package_linux64/build/src/jitlayers.cpp:512
jl_add_to_ee at /buildworker/worker/package_linux64/build/src/jitlayers.cpp:760 [inlined]
jl_finalize_function at /buildworker/worker/package_linux64/build/src/jitlayers.cpp:768
getAddressForFunction at /buildworker/worker/package_linux64/build/src/codegen.cpp:1351
jl_generate_fptr at /buildworker/worker/package_linux64/build/src/codegen.cpp:1445
jl_compile_method_internal at /buildworker/worker/package_linux64/build/src/gf.c:1899
_jl_invoke at /buildworker/worker/package_linux64/build/src/gf.c:2135 [inlined]
jl_apply_generic at /buildworker/worker/package_linux64/build/src/gf.c:2300
showerror at ./task.jl:76
#661 at ./errorshow.jl:76
_jl_invoke at /buildworker/worker/package_linux64/build/src/gf.c:2136 [inlined]
jl_apply_generic at /buildworker/worker/package_linux64/build/src/gf.c:2300
#with_output_color#711 at ./util.jl:365
with_output_color at ./util.jl:363
_jl_invoke at /buildworker/worker/package_linux64/build/src/gf.c:2136 [inlined]
jl_apply_generic at /buildworker/worker/package_linux64/build/src/gf.c:2300
#showerror#660 at ./errorshow.jl:75
#showerror at ./none:0
unknown function (ip: 0x7effac178c01)
_jl_invoke at /buildworker/worker/package_linux64/build/src/gf.c:2136 [inlined]
jl_apply_generic at /buildworker/worker/package_linux64/build/src/gf.c:2300
#showerror#663 at ./errorshow.jl:85
#showerror at ./none:0
unknown function (ip: 0x7effac178991)
_jl_invoke at /buildworker/worker/package_linux64/build/src/gf.c:2136 [inlined]
jl_apply_generic at /buildworker/worker/package_linux64/build/src/gf.c:2300
show_exception_stack at ./errorshow.jl:649
display_error at ./client.jl:110
_jl_invoke at /buildworker/worker/package_linux64/build/src/gf.c:2136 [inlined]
jl_apply_generic at /buildworker/worker/package_linux64/build/src/gf.c:2300
display_error at ./client.jl:112
_jl_invoke at /buildworker/worker/package_linux64/build/src/gf.c:2136 [inlined]
jl_apply_generic at /buildworker/worker/package_linux64/build/src/gf.c:2300
jl_apply at /buildworker/worker/package_linux64/build/src/julia.h:1631 [inlined]
jl_f__apply at /buildworker/worker/package_linux64/build/src/builtins.c:627
jl_f__apply_latest at /buildworker/worker/package_linux64/build/src/builtins.c:665
#invokelatest#1 at ./essentials.jl:709 [inlined]
invokelatest at ./essentials.jl:708 [inlined]
exec_options at ./client.jl:297
_start at ./client.jl:468
jfptr__start_2083.clone_1 at /home/matej.racinsky/projects/SkunkAV/julia-1.3.0-rc4/lib/julia/sys.so (unknown line)
_jl_invoke at /buildworker/worker/package_linux64/build/src/gf.c:2130 [inlined]
jl_apply_generic at /buildworker/worker/package_linux64/build/src/gf.c:2300
unknown function (ip: 0x401931)
unknown function (ip: 0x401533)
__libc_start_main at /lib64/libc.so.6 (unknown line)
unknown function (ip: 0x4015d4)
Allocations: 329204263 (Pool: 329004740; Big: 199523); GC: 413
Segmentation fault (core dumped)

and the code was

using ThreadTools, JSON3
function shas2schema(x)
    JsonGrinder.schema([tweakjson!(load_parse3_sha(s)) for s in x])
end
load_parse3_sha(sha) = JSON3.read(read(`gunzip -c $(jsondir(sha)).json.gz`, String))
schemas = tmap(shas2schema, Iterators.partition(shas, 10_000))

On small amount of data, the asyncmap performed better, thanks for the tip, but when I run it on server, it throws

IOError: pipe: too many open files (EMFILE)

Did you set the batch_size argument to asyncmap?

oh, I forgot to. From the documentation I didn’t get how exactly this related to the number of files opened at the same time.
Any suggestion to what the batch size should be given how many threads I have and how big list I’m sending to the asyncmap?

asyncmap creates 100 tasks by default unless you set ntasks. So “too many open files” sounds strange. But setting ntasks to something like Sys.CPU_THREADS may be better for performance anyway.

I was calling asyncmap in each of the 24 threads, so it was 2400 open files at a time. I’ll se the ntasks to the number of cores then.

If you could provide the full reproduction of the error, I’m happy to take a look (primary author of JSON3 here); I don’t think there’s been much testing of JSON3 in multi-threaded environments, but I also am not aware of anything that would cause problems; the code is pretty straightforward. The segfault unfortunately looks like something in the Julia compiler, which tends to not be the fault of anything the package is doing per-se. But again, only with full reproduction code could we take a look and get to the bottom of what’s going on.

Hi, sorry for long delay, I was not able to reproduce the segfault, so it was probably some black magic.
In the end, there turned out to be more bottle-necks in my data processing pipelines than this, so I live happily ever-after with the standard JSON library.