HTTP.jl doesn't seem to be good at handling over 1k concurrent requests, in comparison to an alternative in Python?

aviatesk · April 27, 2020, 3:53am

I’m implementing an simple API server that is supposed to handle over concurrent 1k requests at a time.

I have a code like below:

using HTTP: @register, Response, handle

const HEADERS = ["Content-Type" => "application/json"]

bench_start_handler(req) = nothing
hoge_handler(req, body) = ... # return JSON string
... # another handler body
bench_end_handler(req) = nothing

const router = HTTP.Router()
@register(router, "GET", "/bench_start", bench_start_handler)
@register(router, "POST", "/hoge", hoge_handler)
... # another handlers
@register(router, "GET", "/bench_end", bench_end_handler)

function handler(req)
    body = isempty(req.body) ? handle(router, req) : handle(router, req, String(req.body))
    return body === nothing ? Response(200) : Response(200, HEADERS; body = body)
end

# entry point
# -----------

function init_server(host::IPAddr = ip"0.0.0.0", port = 3000; async = true, verbose = true, kwargs...)
    if async
        server = Sockets.listen(host, port)
        @async HTTP.serve(handler, host, port; server = server, verbose = verbose, kwargs...)
        return server # supposed to be `close`d afterwards in an interactive session, etc
    else
        return HTTP.serve(handler, host, port; verbose = verbose, kwargs...)
    end
end

Where handlers only do such quiet simple tasks that I’m sure they couldn’t be a source of performance problem.

Our benchmark starts with the /bench_start request as a notification and then over 1k requests will keep to come to a server and a server handles them with various handlers (say, there ~5 handlers). A server will end up handling approximately ~300,000 requests, and finally benchmark ends with the /bench_end request.

When I benchmarked this HTTP.jl server, the code itself works, but it turned out that this implementation is too slow in comparison to an alternative implementation in another languages, bjoern and falcon in Python.
I can’t provide a detail of the benchmark since it’s not public one, but I would like to say HTTP.jl does seem to be slow at doing “handle request → send response” loop concurrently and so the benchmark result was >100 times worse than the alternative implementation in Python.

My question is below:

HTTP.jl is supposed to be good at handling concurrent requests in comparison to those HTTP server implementations in another languages ?
Am I missing something ? Does my code include some mistakes ?

Any help or insight is very much appreciated !

Oscar_Smith · April 27, 2020, 4:00am

This is a not especially well educated guess, but it may be that the library is making a new thread for each connection, which would be very slow if the connections weren’t open for long.

aviatesk · April 27, 2020, 4:35am

Is “connection” each request ? I don’t think each single “request” → “response” task is supposed to take long time (i.e. open for long) in HTTP server, no ?
(I’m not familiar with web stack, so please correct me if my wordings don’t make sense …)

davidanthoff · April 27, 2020, 9:41am

I seem to remember some reports on Slack that the new package server (which also uses HTTP.jl) has (had?) some problems lately? Maybe a similar issue? CC @StefanKarpinski.

sjkelly · April 28, 2020, 12:48am

It might be good to setup a benchmark suite and run some bisects. Are there standard HTTP server benchmarks?

pixel27 · April 28, 2020, 1:28am

First are clients reporting that they couldn’t make a connection to the server? The default backlog is 511 when you call listen(), on Linux I think the max value you can use is around 4096 but you might be able to up that. If you have more than “backlog” connections trying to open at the same time, the OS will have to drop the others.

Next how much CPU do you see being used…I think HTTP.jl is only single threaded. So you shouldn’t see much more than 1 core maxed out plus a little extra for garbage collection. You might need to run N instances of Julia 1 for each core on the machine and use a proxy in front that round-robins the connections to the various Julia instances.

pixel27 · April 28, 2020, 1:41am

Thinking about this some more you might be able to do something like:

using Threads
function init_server(host::IPAddr = ip"0.0.0.0", port = 3000)
    servers = []
    for i in 1:nthreads()
        push!(Threads.@spawn HTTP.serve(handler, host, port + i; server = server))
    end
    return servers
end

Which should start a task on each of the threads available that listens on a different port starting at 3001. I’m not sure if there would be any funnyness with router and mutliple threads however. That keeps everything in one process space.

You would still need a reverse proxy (apache http/nginx) to load balance between the ports.

sjkelly · April 28, 2020, 1:50am

@aviatesk did you run with multiple threads? something link like export JULIA_NUM_THREADS=`nproc` && julia in Bash.

Edit: In Julia 1.5 you can just do julia -tauto

aviatesk · April 28, 2020, 6:46am

As far as I can tell the benchmarker doesn’t complain anything about that. All the requests are processed correctly, but very slowly.

EDIT: server = Sockets.listen(host, port; backlog = whateverbigint) didn’t help.

I found the Julia process only uses 2% of CPU while benchmarkding, and so I guess there is something bad in a setup or HTTP.jl.

I also think HTTP.jl is single threaded too, so I don’t think spawning servers in threads don’t help much.

kristoffer.carlsson · April 28, 2020, 7:14am

It was unrelated to any HTTP stuff.

dfdx · April 28, 2020, 12:09pm

I think there are very few people doing web development in Julia, especially high load web servers. I first tested HTTP servers in Julia around 2016, the results were not very promising and as far as I can see little has changed. Here’s one benchmark using my favorite wrk2.

The code (straight from the README):

using HTTP

HTTP.serve() do request::HTTP.Request
   try
       return HTTP.Response("Hello")
   catch e
       return HTTP.Response(404, "Error: $e")
   end
end

Results (-c = number of connections, -d = duration of test, -t = number of wrk threads, -R = number of requests per second)

$ wrk2 -c1000 -d30s -t4 -R3000 http://127.0.0.1:8081
Running 30s test @ http://127.0.0.1:8081
  4 threads and 1000 connections
  Thread calibration: mean lat.: 72.000ms, rate sampling interval: 150ms
  Thread calibration: mean lat.: 103.433ms, rate sampling interval: 223ms
  Thread calibration: mean lat.: 85.888ms, rate sampling interval: 204ms
  Thread calibration: mean lat.: 81.967ms, rate sampling interval: 167ms
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    51.95ms   22.45ms 110.78ms   76.08%
    Req/Sec   747.98    499.35     1.60k    56.17%
  87004 requests in 30.07s, 5.17MB read
Requests/sec:   2893.49
Transfer/sec:    175.92KB

With higher request rates HTTP.jl starts dropping requests. But even with 3k rps latency of ~50ms isn’t very impressive: for comparison, Python’s aiohttp gives ~10ms on 6-7k rps (approximately, from my memory).

(1000 connections seems optimal for HTTP.jl, both - higher and lower numbers decrease maximum RPS rate).

I believe Julia has great potential to beat most web servers out there (I bet on ~10k rps on Core i7, with latency below 20ms), but for this to happen we need someone with both - interest and spare time to work out performance of HTTP stack.

xiaodai · April 28, 2020, 12:14pm

I can’t find Julia in this benchmark TechEmpower Framework Benchmarks

pixel27 · April 28, 2020, 2:53pm

This is something quick and dirty, but seems to work fairly well and I can max out all my cores:

using Sockets
using HTTP
using DataStructures

struct WebRequest
    http::HTTP.Stream
    done::Threads.Event
end

struct Handler
    queue::CircularDeque{WebRequest}
    lock::ReentrantLock
    notify::Threads.Condition
    shutdown::Threads.Atomic{Bool}
    Handler(size = 512) = begin
        lock = ReentrantLock()
        cond = Threads.Condition(lock)
        new(CircularDeque{WebRequest}(size), lock, cond, Threads.Atomic{Bool}(false))
    end
end

function respond(h::Handler)
    @info "Started $(Threads.threadid())"
    while h.shutdown[] == false
        local request = nothing

        lock(h.lock)
            if isempty(h.queue)
                wait(h.notify)
            end
            if isempty(h.queue) == false
                request = pop!(h.queue)
            end
        unlock(h.lock)

        if request != nothing
            while !eof(request.http)
                readavailable(request.http)
            end
            HTTP.setstatus(request.http, 200)
            write(request.http, "Request received and acknowledged.")
            notify(request.done)
        end
    end
    @info "Stopped $(Threads.threadid())"
end

function start(port = 3000, size = 512)
    local server = Sockets.listen(Sockets.InetAddr(parse(IPAddr, "0.0.0.0"), port))
    local handler = Handler(size)

    for i in 1:Threads.nthreads()-1
        @Threads.spawn respond(handler)
    end

    try
        HTTP.serve(;server = server, stream = true) do stream::HTTP.Stream
            local request = WebRequest(stream, Threads.Event())
            local overflow = false

            lock(handler.lock)
                if length(handler.queue) < size
                    push!(handler.queue, request)
                    notify(handler.notify)
                else
                    overflow = true
                end
            unlock(handler.lock)

            if overflow == false
                wait(request.done)
            else
                @warn "Dropping connection..."
                HTTP.setstatus(request.http, 500)
                write(request.http, "Server overloaded.")
            end
        end
    finally
        close(server)
        handler.shutdown[] = true
    end

end
start(3000, 3000)

I’m running Julia and wrk2 on the same machine, so I only have 8 threads to play with. Giving Julia five threads and wrk2 three threads I can get:

Running 30s test @ http://localhost:3000
  3 threads and 1000 connections
  Thread calibration: mean lat.: 900.854ms, rate sampling interval: 5754ms
  Thread calibration: mean lat.: 1050.845ms, rate sampling interval: 5976ms
  Thread calibration: mean lat.: 945.753ms, rate sampling interval: 5906ms
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     2.50s     1.72s    8.42s    80.55%
    Req/Sec     2.89k    61.60     2.99k    66.67%
  246676 requests in 30.00s, 21.66MB read
  Socket errors: connect 0, read 70, write 0, timeout 624
Requests/sec:   8222.45
Transfer/sec:    739.44KB

But the latency is suffering. At around 3700 requests/sec I can hit a latency of 5.93ms:

Running 30s test @ http://localhost:3000
  3 threads and 1000 connections
  Thread calibration: mean lat.: 99.693ms, rate sampling interval: 801ms
  Thread calibration: mean lat.: 100.208ms, rate sampling interval: 804ms
  Thread calibration: mean lat.: 99.810ms, rate sampling interval: 800ms
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     5.39ms   11.12ms 193.66ms   94.97%
    Req/Sec     1.28k    47.43     1.46k    71.21%
  111728 requests in 30.00s, 9.80MB read
  Socket errors: connect 0, read 46, write 0, timeout 572
Requests/sec:   3724.13
Transfer/sec:    334.64KB

pixel27 · April 28, 2020, 3:20pm

Because I can’t leave well enough alone I tried having the threads spawn async tasks to handle the request under the assumption that those tasks would be handled by that thread. So the doing this:

if request != nothing
    @async begin
        while !eof(request.http)
            readavailable(request.http)
        end
        HTTP.setstatus(request.http, 200)
        write(request.http, "Request recieved and acknowledged.")
        notify(request.done)
    end
end

And I ended up with the timing:

Running 30s test @ http://localhost:3000
  3 threads and 1000 connections
  Thread calibration: mean lat.: 19.789ms, rate sampling interval: 65ms
  Thread calibration: mean lat.: 22.288ms, rate sampling interval: 68ms
  Thread calibration: mean lat.: 22.270ms, rate sampling interval: 69ms
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    19.49ms   12.12ms  95.36ms   77.74%
    Req/Sec     3.22k   749.02     4.76k    63.97%
  276617 requests in 30.00s, 24.28MB read
  Socket errors: connect 0, read 129, write 0, timeout 559
Requests/sec:   9220.78
Transfer/sec:    828.63KB

Which seems pretty good. 20ms latency and 9k requests per second on 5 threads. Granted I seem to be running around 500 timeouted out requests…not sure what that is about…

sjkelly · April 29, 2020, 2:23am

The repo for submission to techempower benchmarks is here: GitHub - TechEmpower/FrameworkBenchmarks: Source for the TechEmpower Framework Benchmarks project

sjkelly · April 30, 2020, 12:12am

I modified the code a bit to see if the same could be done with the base Channel type:

using Sockets
using HTTP

struct WebRequest
    http::HTTP.Stream
    done::Threads.Event
end

struct Handler
    queue::Channel{WebRequest}
    count::Threads.Atomic{Int}
    shutdown::Threads.Atomic{Bool}
    Handler( size = 512 ) = begin
        new(Channel{WebRequest}(size), Threads.Atomic{Int}(0), Threads.Atomic{Bool}(false))
    end
end

function respond(h::Handler)
    @info "Started $(Threads.threadid())"
    while h.shutdown[] == false
        request = take!(h.queue)
        Threads.atomic_sub!(h.count, 1)
        @async begin
            while !eof(request.http)
                readavailable(request.http)
            end
            HTTP.setstatus(request.http, 200)
            write(request.http, "Request recieved and acknowledged.")
            notify(request.done)
        end
    end
    @info "Stopped $(Threads.threadid())"
end

function start(port = 3000, size = 512)
    local server = Sockets.listen(Sockets.InetAddr(parse(IPAddr, "0.0.0.0"), port))
    local handler = Handler()

    for i in 1:Threads.nthreads()-1
        @Threads.spawn respond(handler)
    end

    try
        HTTP.serve(;server = server, stream = true) do stream::HTTP.Stream

            if handler.count[] < size
                Threads.atomic_add!(handler.count, 1)
                local request = WebRequest(stream, Threads.Event())
                put!(handler.queue, request)
                wait(request.done)
            else
                @warn "Dropping connection..."
                HTTP.setstatus(stream, 500)
                write(stream, "Server overloaded.")
            end
        end
    finally
        close(server)
        handler.shutdown[] = true
    end

end
println("starting server")
start(3000, 3000)

# benchmark with
# ./wrk -c1000 -d30s -t4 -R3000 http://127.0.0.1:3000

The code is a little bit shorter and seems to have a little lower latency in my tests. However, I think there might be a slight race condition in checking/updating the Channel length. The lock might need to be added, which would defeat the purpose of a Channel in this case.

pixel27 · April 30, 2020, 12:33am

That’s a good test, I didn’t try it because there are been some posts saying Channels have some sort of inherent latency but the usage model might be different from this usage model.

In theory your code does have a small race condition where the number of queued requests may exceed size. However I believe the HTTP.serve is single threaded meaning there is only 1 thread filling the queue so that race condition won’t be hit. If the library is updated to be multi-threaded then yes it could be hit but only go over by the number of threads. Which is probably acceptable.

You could handle it without a lock by doing:

current = handler.count[]
if current < size
    while Threads.atomic_cas!(handler.count, current, current + 1) != current
        current = handler.count[] 
        if current >= size
            break
        end
    end
end

if current < size
   # Add request
end

Which is probably overkill.

yoh-meyers · May 12, 2020, 10:57am

Am curious (coming from Python world) if using Unix socket in combination with WSGI equivalent could be solution?

Is quick guess, so apologies if am not aware of something similar being planned.

aviatesk · May 12, 2020, 12:42pm

I don’t think there is WSGI equivalent.
fwiw, rather, I ended up hosting WSGI server as is …

yoh-meyers · May 12, 2020, 3:29pm

Interesting that by using Pycall your solution is still faster than native Julia.

Have you heard of Vibora?
Their async architecture makes it particularly interesting as it doesn’t even have to rely on WSGI.

I’m considering importing this functionality to Julia ecosystem, as for sure a Julia solution would be faster, helping gain interest for language from outside Julia community.