Julia can be better at doing web: A benchmark

pankgeorg · August 28, 2023, 4:47pm

Disclaimer: that is, iff I haven’t written anything bad while authoring the benchmark below, (which is likely because I’m not sure what this means Server · HTTP.jl, and whether it’s “recommended” to do this another way). If that is so, Julia is very bad at doing web.

TechEmpower’s Framework benchmarks

(GitHub - TechEmpower/FrameworkBenchmarks: Source for the TechEmpower Framework Benchmarks project) is a framework for easily and homogeneously evaluating different languages and frameworks on simple web scenarios that are non trivial. It provides a set of tools, build scripts and dockerfiles to make reproducibility and development easy. I’ve made a new PR with the latest and greatest Julia and toolset and made a PR out of it: feat(benchmark): updated Julia/HTTP.jl benchmark by pankgeorg · Pull Request #8370 · TechEmpower/FrameworkBenchmarks · GitHub

Previous work

There was already some previous work on HTTP.jl and Julia, (http-jl performance updates by mcmcgrath13 · Pull Request #6215 · TechEmpower/FrameworkBenchmarks · GitHub “HTTP.jl”) and (Jewelia : Plaintext, JSON Serialization, Single Database Query, Multiple Database Queries by donavindebartolo · Pull Request #6829 · TechEmpower/FrameworkBenchmarks · GitHub “Jewelia”); this work is basically a more idiomatic rewrite of the Jewelia on top of the (now defunct, in the sense that it doesn’t ‘verify’ for me) HTTP.jl entry.

The benchmark

The packages used are pretty straightforward for Julia:

HTTP.jl for internet
JSON3.jl for json
LibPQ.jl for postgres communication
HypertextLiteral.jl for HTML interpolation
And that’s it.

Important

There is also a “hack” that I kept around; the run.sh script launches # cores processes, each accepting connections based on a Linux kernel esoteric load balancing on sockets that have SO_REUSEPORT enabled (reuseaddr=true on HTTP.jl). I don’t think anyone deploys services like that, and I would really like to get rid of this in the future. Unfortunately though, without this hack, the server doesn’t verify (./tfb --mode verify --test http-jl) (which means it doesn’t serve enough requests to be thought of as a pass)

The rivals

We run this against

Python’s fastapi
Python’s flask
Javascript’s fastify (the default benchmark uses mongodb, check fastify-postgres where applicable)
All set up with the postgres database.

Do ./tfb --mode benchmark --test http-jl fastapi flask fastify-postgres, if you want to follow through)

The hardware

I run this on an Ampere Altra q80-30 with 80 CPUs and 256GB of RAM and the results are kind of disappointing (verified on other hardware too, but every number I refer to, will be from this computer). Julia is 3-10 times slower than any of the other frameworks we benchmark against - only faster in JSON serialization under medium load.

Result Summary

fortune, Query the db, sort and output HTML: 4-7 times slower
plaintext, Respond with a String asap: 4-60 times slower
db, Perform a single db query: 4-8 times slower
update, Perform db updates, 5-7 times slower
query, Perform variable number of db reads, 5-7 times slower
json, Serialize a json and return it, *0.5-5 times slower (0.5 = faster)


+------------------------------------------------------------------------------+
|                      Type: fortune, Result: latencyAvg                       |
+-------------------+---------+---------+------------------+---------+---------+
| concurrencyLevels | fastapi | fastify | fastify-postgres |   flask | http-jl |
+-------------------+---------+---------+------------------+---------+---------+
|                16 |  0.89ms |  0.88ms |         477.13us |  1.38ms |  3.83ms |
|                32 |  0.92ms |  0.93ms |         497.69us |  1.64ms |  4.66ms |
|                64 |  1.09ms |  1.52ms |         626.20us |  2.30ms |  6.09ms |
|               128 |  1.21ms |  1.68ms |         752.12us |  2.33ms |  6.64ms |
|               256 |  4.05ms |  4.87ms |           2.89ms |  6.17ms | 28.84ms |
|               512 | 10.37ms |  7.61ms |           7.19ms | 12.51ms | 47.33ms |
+-------------------+---------+---------+------------------+---------+---------+

+-----------------------------------------------------------------------+
|                  Type: plaintext, Result: latencyAvg                  |
+---------------------------+----------+----------+----------+----------+
| pipelineConcurrencyLevels |  fastapi |  fastify |    flask |  http-jl |
+---------------------------+----------+----------+----------+----------+
|                       256 |   3.73ms |   3.20ms |   6.35ms | 244.96ms |
|                      1024 |  11.68ms |   8.22ms |  22.60ms | 246.80ms |
|                      4096 |  52.84ms |  30.18ms |  83.89ms | 360.46ms |
|                     16384 | 222.26ms | 338.33ms | 287.10ms | 837.97ms |
+---------------------------+----------+----------+----------+----------+

+-------------------------------------------------------------------------------+
|                          Type: db, Result: latencyAvg                         |
+-------------------+----------+----------+------------------+--------+---------+
| concurrencyLevels |  fastapi |  fastify | fastify-postgres |  flask | http-jl |
+-------------------+----------+----------+------------------+--------+---------+
|                16 | 745.29us | 808.95us |         514.45us | 0.89ms |  3.36ms |
|                32 | 749.84us |   0.91ms |         542.55us | 1.15ms |  3.61ms |
|                64 |   0.89ms |   1.27ms |         769.67us | 1.54ms |  4.62ms |
|               128 |   0.96ms |   1.46ms |           0.89ms | 1.63ms |  4.95ms |
|               256 |   3.37ms |   3.46ms |           3.00ms | 3.96ms | 31.11ms |
|               512 |   8.82ms |   6.16ms |           6.98ms | 8.46ms | 43.71ms |
+-------------------+----------+----------+------------------+--------+---------+

+---------------------------------------------------------------------------------+
|                         Type: update, Result: latencyAvg                        |
+-------------------+---------+----------+------------------+----------+----------+
| concurrencyLevels | fastapi |  fastify | fastify-postgres |    flask |  http-jl |
+-------------------+---------+----------+------------------+----------+----------+
|                16 | 11.38ms |  11.98ms |          11.14ms |  14.51ms |  48.79ms |
|                32 | 17.94ms |  34.00ms |          42.87ms |  34.83ms | 222.12ms |
|                64 | 27.98ms |  61.12ms |          79.74ms |  61.99ms | 440.19ms |
|               128 | 43.71ms |  86.99ms |         120.61ms |  96.97ms | 669.13ms |
|               256 | 70.74ms | 113.50ms |         156.69ms | 155.97ms | 848.30ms |
+-------------------+---------+----------+------------------+----------+----------+

+---------------------------------------------------------------+
|                 Type: json, Result: latencyAvg                |
+-------------------+----------+----------+----------+----------+
| concurrencyLevels |  fastapi |  fastify |    flask |  http-jl |
+-------------------+----------+----------+----------+----------+
|                16 | 224.01us |  85.99us | 381.38us | 152.81us |
|                32 | 265.37us |  86.95us | 433.82us | 199.09us |
|                64 | 268.03us | 132.31us | 437.30us |   1.72ms |
|               128 | 275.18us | 177.15us | 484.25us |   4.78ms |
|               256 |   1.08ms |   0.93ms |   0.87ms |   8.73ms |
|               512 |   1.57ms |   1.47ms |   2.23ms |  10.68ms |
+-------------------+----------+----------+----------+----------+

+----------------------------------------------------------------------------+
|                      Type: query, Result: latencyAvg                       |
+----------------+---------+---------+------------------+---------+----------+
| queryIntervals | fastapi | fastify | fastify-postgres |   flask |  http-jl |
+----------------+---------+---------+------------------+---------+----------+
|              1 | 11.37ms |  6.11ms |           6.95ms |  9.76ms |  44.70ms |
|              5 | 16.46ms | 22.26ms |          29.39ms | 20.85ms | 150.07ms |
|             10 | 22.56ms | 42.61ms |          46.50ms | 33.79ms | 281.09ms |
|             15 | 27.77ms | 62.12ms |          62.88ms | 46.93ms | 403.34ms |
|             20 | 34.98ms | 81.63ms |          88.61ms | 58.67ms | 524.15ms |
+----------------+---------+---------+------------------+---------+----------+

+----------------------------------------------------------------------------------+
|                       Type: fortune, Result: totalRequests                       |
+-------------------+-----------+-----------+------------------+---------+---------+
| concurrencyLevels |   fastapi |   fastify | fastify-postgres |   flask | http-jl |
+-------------------+-----------+-----------+------------------+---------+---------+
|                16 |   271,393 |   274,710 |          503,555 | 174,686 |  62,738 |
|                32 |   523,920 |   518,147 |          976,489 | 295,919 | 104,850 |
|                64 |   892,275 |   709,285 |        1,548,450 | 444,227 | 158,516 |
|               128 | 1,003,488 |   747,625 |        1,605,884 | 518,438 | 181,736 |
|               256 | 1,202,922 |   917,032 |        1,590,039 | 621,660 | 212,334 |
|               512 | 1,115,387 | 1,058,308 |        1,605,699 | 600,512 | 213,201 |
+-------------------+-----------+-----------+------------------+---------+---------+

+------------------------------------------------------------------------------+
|                    Type: plaintext, Result: totalRequests                    |
+---------------------------+------------+------------+------------+-----------+
| pipelineConcurrencyLevels |    fastapi |    fastify |      flask |   http-jl |
+---------------------------+------------+------------+------------+-----------+
|                       256 | 10,932,009 | 17,726,288 |  8,052,246 | 8,526,728 |
|                      1024 | 12,632,536 | 19,756,110 |  8,443,728 | 9,153,410 |
|                      4096 | 12,057,721 | 21,832,874 |  9,120,014 | 6,898,582 |
|                     16384 | 11,377,496 | 22,162,879 | 10,346,123 | 5,799,038 |
+---------------------------+------------+------------+------------+-----------+

+----------------------------------------------------------------------------------+
|                         Type: db, Result: totalRequests                          |
+-------------------+-----------+-----------+------------------+---------+---------+
| concurrencyLevels |   fastapi |   fastify | fastify-postgres |   flask | http-jl |
+-------------------+-----------+-----------+------------------+---------+---------+
|                16 |   323,173 |   297,546 |          469,079 | 274,389 |  71,569 |
|                32 |   644,724 |   529,115 |          899,553 | 426,079 | 133,436 |
|                64 | 1,098,160 |   772,316 |        1,289,557 | 634,178 | 212,017 |
|               128 | 1,253,803 |   824,269 |        1,356,583 | 751,095 | 244,968 |
|               256 | 1,493,079 | 1,130,161 |        1,412,778 | 961,526 | 241,939 |
|               512 | 1,393,675 | 1,263,982 |        1,439,682 | 903,827 | 248,318 |
+-------------------+-----------+-----------+------------------+---------+---------+

+------------------------------------------------------------------------------+
|                     Type: update, Result: totalRequests                      |
+-------------------+---------+---------+------------------+---------+---------+
| concurrencyLevels | fastapi | fastify | fastify-postgres |   flask | http-jl |
+-------------------+---------+---------+------------------+---------+---------+
|                16 | 716,373 | 634,709 |          726,061 | 520,125 | 160,672 |
|                32 | 420,562 | 214,406 |          172,978 | 213,297 |  33,285 |
|                64 | 271,223 | 119,172 |           91,227 | 120,104 |  16,905 |
|               128 | 177,156 |  83,380 |           60,063 |  78,563 |  11,175 |
|               256 | 129,688 |  63,871 |           46,175 |  50,472 |   8,459 |
+-------------------+---------+---------+------------------+---------+---------+

+--------------------------------------------------------------------+
|                 Type: json, Result: totalRequests                  |
+-------------------+-----------+------------+-----------+-----------+
| concurrencyLevels |   fastapi |    fastify |     flask |   http-jl |
+-------------------+-----------+------------+-----------+-----------+
|                16 | 1,083,658 |  2,761,853 |   670,526 | 1,681,566 |
|                32 | 1,825,652 |  5,617,677 | 1,163,340 | 3,167,683 |
|                64 | 3,593,179 |  8,901,280 | 2,261,998 | 4,986,692 |
|               128 | 4,372,007 |  9,130,665 | 2,520,284 | 5,294,742 |
|               256 | 8,197,688 | 10,977,348 | 5,356,670 | 5,931,912 |
|               512 | 9,917,663 | 11,663,041 | 5,696,215 | 6,382,643 |
+-------------------+-----------+------------+-----------+-----------+

+-----------------------------------------------------------------------------+
|                      Type: query, Result: totalRequests                     |
+----------------+---------+-----------+------------------+---------+---------+
| queryIntervals | fastapi |   fastify | fastify-postgres |   flask | http-jl |
+----------------+---------+-----------+------------------+---------+---------+
|              1 | 915,169 | 1,265,911 |        1,418,118 | 796,352 | 247,699 |
|              5 | 518,331 |   327,757 |          330,922 | 363,645 |  55,519 |
|             10 | 355,150 |   170,420 |          176,096 | 222,329 |  28,009 |
|             15 | 290,846 |   116,681 |          124,172 | 162,128 |  19,342 |
|             20 | 230,072 |    88,826 |           85,962 | 126,880 |  14,594 |
+----------------+---------+-----------+------------------+---------+---------+

+----------------------------------------------------------------------------------+
|                        Type: fortune, Result: latencyMax                         |
+-------------------+----------+----------+------------------+----------+----------+
| concurrencyLevels |  fastapi |  fastify | fastify-postgres |    flask |  http-jl |
+-------------------+----------+----------+------------------+----------+----------+
|                16 |  11.94ms |   5.28ms |           3.17ms |  16.36ms |  30.31ms |
|                32 |  14.44ms |   6.89ms |          14.11ms |  22.11ms |  82.40ms |
|                64 |  17.26ms |  24.29ms |          15.03ms |  28.59ms |  95.06ms |
|               128 |  18.03ms |  23.11ms |          19.64ms |  28.62ms | 133.23ms |
|               256 |  70.93ms |  78.35ms |          72.66ms |  78.20ms | 392.05ms |
|               512 | 197.71ms | 117.00ms |         172.46ms | 173.53ms | 531.57ms |
+-------------------+----------+----------+------------------+----------+----------+

+----------------------------------------------------------------------+
|                 Type: plaintext, Result: latencyMax                  |
+---------------------------+----------+----------+----------+---------+
| pipelineConcurrencyLevels |  fastapi |  fastify |    flask | http-jl |
+---------------------------+----------+----------+----------+---------+
|                       256 |  40.80ms |  62.20ms | 102.05ms |   4.50s |
|                      1024 | 124.68ms | 148.06ms | 382.68ms |   3.88s |
|                      4096 | 543.97ms | 660.62ms |    1.20s |   4.85s |
|                     16384 |    1.48s |    8.00s |    2.55s |   8.00s |
+---------------------------+----------+----------+----------+---------+

+---------------------------------------------------------------------------------+
|                           Type: db, Result: latencyMax                          |
+-------------------+----------+----------+------------------+---------+----------+
| concurrencyLevels |  fastapi |  fastify | fastify-postgres |   flask |  http-jl |
+-------------------+----------+----------+------------------+---------+----------+
|                16 |   9.91ms |   6.58ms |          10.52ms | 16.57ms |  17.22ms |
|                32 |  17.21ms |   7.34ms |          14.60ms |  6.56ms |  49.28ms |
|                64 |  19.50ms |  17.32ms |          16.71ms | 13.56ms | 110.67ms |
|               128 |  15.07ms |  23.07ms |          14.88ms | 16.79ms |  92.80ms |
|               256 |  75.14ms |  66.80ms |          66.10ms | 67.44ms | 369.63ms |
|               512 | 142.55ms | 127.03ms |         140.99ms | 70.41ms | 520.01ms |
+-------------------+----------+----------+------------------+---------+----------+

+----------------------------------------------------------------------------------+
|                         Type: update, Result: latencyMax                         |
+-------------------+----------+----------+------------------+----------+----------+
| concurrencyLevels |  fastapi |  fastify | fastify-postgres |    flask |  http-jl |
+-------------------+----------+----------+------------------+----------+----------+
|                16 | 262.21ms | 190.21ms |         149.73ms | 314.95ms | 489.07ms |
|                32 | 197.06ms | 169.57ms |         207.58ms | 289.64ms |    1.36s |
|                64 | 254.84ms | 237.61ms |         294.31ms | 349.07ms |    3.03s |
|               128 | 507.05ms | 261.51ms |         304.46ms | 719.62ms |    4.59s |
|               256 |    1.93s | 347.90ms |         390.00ms |    1.55s |    5.26s |
+-------------------+----------+----------+------------------+----------+----------+

+------------------------------------------------------------+
|               Type: json, Result: latencyMax               |
+-------------------+---------+---------+---------+----------+
| concurrencyLevels | fastapi | fastify |   flask |  http-jl |
+-------------------+---------+---------+---------+----------+
|                16 |  1.76ms |  3.83ms | 13.74ms |   9.81ms |
|                32 | 13.85ms | 13.34ms | 14.34ms | 180.55ms |
|                64 | 10.94ms | 18.60ms | 13.79ms | 248.09ms |
|               128 | 13.82ms | 20.02ms | 16.43ms | 268.00ms |
|               256 | 25.00ms | 48.13ms | 28.01ms | 387.59ms |
|               512 | 31.31ms | 50.61ms | 62.14ms | 384.00ms |
+-------------------+---------+---------+---------+----------+

+-------------------------------------------------------------------------------+
|                        Type: query, Result: latencyMax                        |
+----------------+----------+----------+------------------+----------+----------+
| queryIntervals |  fastapi |  fastify | fastify-postgres |    flask |  http-jl |
+----------------+----------+----------+------------------+----------+----------+
|              1 | 240.92ms | 121.50ms |         121.01ms | 144.19ms | 508.16ms |
|              5 | 196.28ms | 103.58ms |         282.77ms | 330.64ms |    1.02s |
|             10 | 369.35ms | 115.44ms |         388.54ms | 469.51ms |    1.56s |
|             15 | 481.79ms | 148.37ms |         336.69ms | 661.66ms |    2.48s |
|             20 | 566.23ms | 157.20ms |         373.01ms | 601.39ms |    2.67s |
+----------------+----------+----------+------------------+----------+----------+

Help please! Next steps:

So, as I mentioned above, I’m not sure if I’m missing anything huge in how to deploy HTTP.jl services. Maybe there is a super smart way to push all work to non-interactive threads and keep the scheduler responsive and good, as Remove in threaded region and add a thread that runs the UV loop by gbaraldi · Pull Request #50880 · JuliaLang/julia · GitHub does. So julia needs your help in:

Making sure the benchmark is as good as it can be (realistically - things an average user with internet and a love for reading docs would do) That includes looking for different potential culprits:
- Is the GC the bottleneck? Does the benchmark become better if the GC is off? Does it become worse in low memory settings?
- Are there gotchas I’ve missed? Am I using the connection pool correctly? Is there an initialization overhead somewhere?
- Is Julia performing better in other Hardware? This was tested in ARM (and validated from a colleague in arm64) but different hardware may have significantly different results. Please run tests and let us know if you see something different!
Is there an easy and documented way to keep the “main” thread free to schedule away tasks?
Identify the remaining issues
Make a plan to address them

Julia can be as good as node at this, and even better if we consider we have more tools and primitives and cool things at our disposal! Let’s make it happen!

If enough people feel that this is a priority, we can spin off this discussion to a community call to address these issues

Speculation Zone

👆🏾👆🏾👆🏾👉🏾👉🏾👉🏾

There are a few reasons I believe these results are like what they are (NOTE: We’re in the **speculation zone** here, nothing from now on is validated/verified/official, just sharing my gut feelings):

Unexpected `sleep` function behavior in relation to main thread activity in multithreading context · Issue #50643 · JuliaLang/julia · GitHub is really bad for HTTP services, related discussions include Bug in sleep() function - main thread work affecting sleep duration on running tasks and How to run a HTTP.jl server in parallel, while doing computations in the foreground?. It would be very interesting to test if Remove in threaded region and add a thread that runs the UV loop by gbaraldi · Pull Request #50880 · JuliaLang/julia · GitHub from Gabriel Baraldi makes this better, and how much better.
Julia is very bad at scheduling the TCP connection accepts, in relation to the default backlog (511 - even though I do raise this). Even if the scheduler was working fine (which I don’t think it is), in a high core/high tasks scenario, the “task” that accepts only gets 1/n of the time, and it only accepts one connection before it yields. A solution to that could be similar to feat(Sockets.jl): accept as many connections as possible in one loop by pankgeorg · Pull Request #1 · pankgeorg/julia · GitHub, aggressively accept as many connections as you can (up to a constant that’s proportional to the available threads?) (without blocking) and then share them to tasks/threads).
Sockets/HTTP.jl don’t have the notion of 429/Too many connections; never give up and never know that they can’t, and shouldn’t accept new connections (see the very draft PR above for a thought on this)
HTTP doesn’t handle Connection: keep-alive (I think?) and neither does it handle closed connections from upstream very well (these are both not validated; the symptom is a lot of EPIPE errors when siege times-out the connection, 5 seconds after it opens them) Again, 429s would help
There is a rumor on the web that if the unaccepted connections reach 511, libuv stops listening: Mention in documentation that not accepting will disable listening · Issue #3071 · libuv/libuv · GitHub. It’s not entirely true (as the service seems to recover after a while?), but it’s alarming.
GitHub - iamed2/LibPQ.jl: A Julia wrapper for libpq feels to be blocking the scheduler a lot (see that the db-intensive benchmarks are collapsing after some concurrency is being added, so there may be some low hanging fruits in making the async interface with libpq better (see the intro to Multi-Threading · The Julia Language)

(heartbeat poll)

This is consistent with my Julia services performance
I have not had this much traffic so I don’t know
I have had a lot of traffic (~ hundreds+ requests per second + database) and I didn’t have issues
I am not using Julia for web services

0 voters

Thanks for reading all this

– Panagiotis

quinnj · August 28, 2023, 5:21pm

Fantastic analysis @pankgeorg. It’s been on my list for a while to investigate server-side performance at the HTTP.jl layer, but I also suspect there are libuv-level integration issues that are holding back performance.

Not to get too speculative, but one path forward I’m currently exploring is wrapping the aws-crt suite of libraries as a full-stack solution for network operations (wrapper started here). The wrapper is focused on client-side for now, but the plan (assuming all goes well) is to wrap server-side as well. What’s unique about this situation is that AWS has open-sourced their fully vertically-integrated network stack that powers all their aws SDKs and language bindings. And it starts from the lowest levels of basically having their own Base/stdlibs, then task scheduler + io event loop infrastructure, then native socket + TLS implementations, then HTTP on top of that. Part of the performance challenges I’ve encountered with HTTP.jl is just the sheer number of integrations that have to be managed: libuv + Julia core task scheduler + OpenSSL/MbedTLS TLS layer + HTTP + Julia IO interfaces, etc. There’s a lot of room for issues and bottlenecks.

Anyway, happy to chat more as things make progress, but it’s not a full-time project quite yet and I’m just chipping away at client-side functionality for now. Cheers.

adienes · August 28, 2023, 5:58pm

I don’t really have anything to add but just wanted to commend you for one of the most constructively critical posts I’ve read

Palli · August 28, 2023, 6:00pm

plaintext, Respond with a String asap: 4-60 times slower

[None of the other are strictly pure web benchmarks, except maybe json, though all need improving, we also want fast DB access with usually goes with it.]

These numbers do not make sense. 244.9/3.73 (yes, 66 times slower) down to 837.97/222.26 (3.78 times), but fortune is doing more complex work in 28.84 ms comparable to the first number, so I think something strange is going on, and maybe some startup latency added?

My first thought was that you were testing Julia startup, over 200 ms. and it should be subtracted (for something like only 30% slower to 2.9 times), but that seems odd since not a problem for fortune.

It’s possible you have some GC issue, did you try on master (or 1.10-beta2), which I believe may be better?

json, Serialize a json and return it, *0.5-5 times slower (0.5 = faster)

I started with taking a quick look at your code, seeing:

headers = [
         "Content-Type" => "application/json",
         "Server" => "Julia-HTTP",
         "Date" => Dates.format(Dates.now(), Dates.RFC1123Format) * " GMT",
       ]
3-element Vector{Pair{String, String}}:
 "Content-Type" => "application/json"
       "Server" => "Julia-HTTP"
         "Date" => "Mon, 28 Aug 2023 17:23:18 GMT"

Note, this is not JSON (nor a Dict). Julia has no such datatype. A Pair is implemented by a struct, and it’s fast in isolation, but if you want to index into that total structure, then you can do headers[1] for the first Pair, etc. or iterate, but do you want to index by e.g. “Content-Type”? Then you want a Dict. In your case, there, I think this doesn’t matter, you’re likely to just iterate simply, but something to keep in mind in general.

In fact unlike in e.g. (current) Python, Julia’s Dict is not ordered. Maybe you want OrderDict, or even what I think is comparable DefaultOrderDict from DataStructures.jl? I guess I’m looking at the wrong end, the fastest code, which is actually 2x faster…

tim.holy · August 28, 2023, 6:11pm

This seems like an great analysis, things like this do tend to provoke a closer look and hopefully substantial optimizations.

I know I could just read the code, but can you clarify one thing: is TTFX possibly contaminating the results? That’s a problem for some benchmarks, but perhaps this has been designed in a way that ignores the first run?

pankgeorg · August 28, 2023, 6:46pm

I’m very intrigued to see the libuv-level integration issues and possible mitigations!

This is very exciting! During this exercise, ccalls have been giving me goosebumps with how they interact with the scheduler, but I do find value in getting battle tested code and integrating it so I look forward to it!

I understand, totally. I am hoping we can get some nice discussions to steer our efforts towards better julia on the web

Could you run the benchmark locally and let me know what results you get? It may have something to do with the environment (that computer has 80 cores, maybe another framework parallelizes really well)
To run it:

git clone git@github.com:pankgeorg/FrameworkBenchmarks.git
cd FrameworkBenchmarks
git checkout pg/julia-new
./tfb --mode verify --test http-jl to verify the benchmark runs locally (note: it needs docker)
./tfb --mode benchmark --test http-jl fastapi flask fastify-postgres (this step will create results in the “results” folder
use this tool to convert the json file to a small table GitHub - joeyleeeeeee97/PlainTextResultsParser: Provide a plaintext results [FrameWorkBenchmark](https://github.com/TechEmpower/FrameworkBenchmarks) json parser based on python [prettytable](https://pypi.org/project/prettytable/),

The vector you see is serialized in the response’s headers (one of the requirements of the benchmark is to serve the current date in the headers); the serialization of the body to JSON happens a few lines below, here, with the (proven to be) excellent JSON3.

I really hope to get some time from really smart people on it! Julia uses libuv exactly like node does, so we should be at least on par with that. A path forward, if my hunches are correct is to make sure that all “scheduler-blocking” work is managed in a way that doesn’t hinder IO performance. Julia gives great power to us programmers, and it’s really exciting to have all the threads, ccalling, JIT and everything at your fingertips; to play well with the async nature of web though, we need some “rules” (Node’s rule is that user code only runs on one thread and can’t block anything. [Note: node, if I understand correctly, is not single threaded for I/O, just for “userland”]). I don’t think we need borders that strict to be good at web, but we do need more consciousness around side effects of stuff (e.g. the excellent LibPQ’s LibPQ.execute function is blocking julia completely; in an web-first world (which Julia isn’t) that would come with a neon STOP sign) - (the benchmark uses the LibPQ.async_execute which is non blocking and works as expected). Less extreme examples are here and there (see the sleep issue, or my conjecture about accept)

I don’t think it is; this is actually one of the reasons I picked this benchmark suite and didn’t built my own; they have abstracted away these questions:

(from https://www.techempower.com/benchmarks/#section=motivation)

16. *"How is each test run?"* Each test is executed as follows:
  1. Restart the database servers.
  2. Start the platform and framework using their start-up mechanisms.
  3. Run a 5-second **primer** at 8 client-concurrency to verify that the server is in fact running. These results are not captured.
  4. Run a 15-second **warmup** at 256 client-concurrency to allow lazy-initialization to execute and just-in-time compilation to run. These results are not captured.
  5. Run a 15-second **captured test** for each of the concurrency levels (or iteration counts) exercised by the test type. Concurrency-variable test types are tested at 16, 32, 64, 128, 256, and 512 client-side concurrency. The high-concurrency *plaintext* test type is tested at 256, 1,024, 4,096, and 16,384 client-side concurrency.
  6. Stop the platform and framework.

So, the relevant codepaths must have been compiled already.

rdavis120 · August 29, 2023, 1:42am

Was it related to this discussion?

Pangoraw · August 29, 2023, 6:56am

EDIT: I did not look at right code! The new PQ version indeed uses a pool (here).

The Julia benchmark creates a database connection for each request whereas other frameworks seem to use a pool of global connections typically via an ORM (knex for fastify) which should be reused between requests. Maybe this is one cause for the lower performance on database benchmarks?

Example:

        conn = DBInterface.connect(MySQL.Connection, "tfb-database", "benchmarkdbuser", "benchmarkdbpass", db="hello_world")
        results = DBInterface.execute(conn, sqlQuery)
        # ...
        DBInterface.close!(conn)

versus

async function allFortunes() {
  return knex("Fortune").select("*");
}

Jose_Diaz · August 29, 2023, 1:09pm

You should check Oxygen.jl, it’s quite similar to FastAPI, last time I ran a tiny microbenchmark it looked faster than FastAPI

mbauman · August 29, 2023, 1:16pm

Oxygen uses HTTP.jl without much extra configuration, so I’d expect very similar experience. If I’m not mistaken, your previous benchmarks were 1000 single-threaded requests. This is looking at hundreds of thousands of requests at high concurrency.

pankgeorg · August 29, 2023, 2:30pm

The benchmark is not using SSL at all, so it shouldn’t be that. There may be performance bottlenecks in another ccalled library though.

Indeed; that is one of the things I added! Now that doesn’t mean that the implementation is indeed correct, I may have missed something (comments/suggestions on the PR are very welcome!)

The pool size is 25 which should play well for 80 cores, since the benchmark’s postgres has a limit of 2000 connections. Let me know if you have any suggestions!!

Oxygen seems to have a very cool parallelization implementation; I think I can hook it up to the benchmark pretty easily (give me a few days). If that can actually share the work between cores efficiently (and help us get rid of the hack I mention above) I’ll be super happy.

That is my hunch too (to have a very similar experience). The benchmark is issuing as many requests as possible, doing 16.000 concurrent at the “final” stage.

sdanisch · August 30, 2023, 12:01pm

The linked benchmark GitHub - omcloudinc/c_http_jl: wrapping c lib and benchmarking with http.jl and node looked easy enough to reproduce locally so I gave it a shot!
Interestingly enough, Oxygen is slowest for me on Julia 1.10.0-beta2 (didn’t specifically test that julia version, it’s just what I currently use).

HTTP.jl:
┌─────────┬──────┬──────┬───────┬──────┬─────────┬─────────┬────────┐
│ Stat    │ 2.5% │ 50%  │ 97.5% │ 99%  │ Avg     │ Stdev   │ Max    │
├─────────┼──────┼──────┼───────┼──────┼─────────┼─────────┼────────┤
│ Latency │ 0 ms │ 0 ms │ 0 ms  │ 0 ms │ 0.07 ms │ 7.04 ms │ 869 ms │
└─────────┴──────┴──────┴───────┴──────┴─────────┴─────────┴────────┘
┌───────────┬────────┬────────┬────────┬────────┬──────────┬─────────┬────────┐
│ Stat      │ 1%     │ 2.5%   │ 50%    │ 97.5%  │ Avg      │ Stdev   │ Min    │
├───────────┼────────┼────────┼────────┼────────┼──────────┼─────────┼────────┤
│ Req/Sec   │ 2071   │ 2071   │ 14943  │ 16527  │ 13826.46 │ 3777.82 │ 2071   │
├───────────┼────────┼────────┼────────┼────────┼──────────┼─────────┼────────┤
│ Bytes/Sec │ 122 kB │ 122 kB │ 882 kB │ 975 kB │ 816 kB   │ 223 kB  │ 122 kB │
└───────────┴────────┴────────┴────────┴────────┴──────────┴─────────┴────────┘
Oxygen.jl:

┌─────────┬──────┬──────┬───────┬──────┬─────────┬────────┬───────┐
│ Stat    │ 2.5% │ 50%  │ 97.5% │ 99%  │ Avg     │ Stdev  │ Max   │
├─────────┼──────┼──────┼───────┼──────┼─────────┼────────┼───────┤
│ Latency │ 0 ms │ 0 ms │ 1 ms  │ 1 ms │ 0.37 ms │ 0.6 ms │ 19 ms │
└─────────┴──────┴──────┴───────┴──────┴─────────┴────────┴───────┘
┌───────────┬─────────┬─────────┬─────────┬─────────┬─────────┬─────────┬─────────┐
│ Stat      │ 1%      │ 2.5%    │ 50%     │ 97.5%   │ Avg     │ Stdev   │ Min     │
├───────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤
│ Req/Sec   │ 10079   │ 10079   │ 10223   │ 10375   │ 10244.8 │ 90.97   │ 10072   │
├───────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤
│ Bytes/Sec │ 1.01 MB │ 1.01 MB │ 1.02 MB │ 1.04 MB │ 1.02 MB │ 9.14 kB │ 1.01 MB │


JSServe.jl:

┌─────────┬──────┬──────┬───────┬──────┬─────────┬─────────┬──────┐
│ Stat    │ 2.5% │ 50%  │ 97.5% │ 99%  │ Avg     │ Stdev   │ Max  │
├─────────┼──────┼──────┼───────┼──────┼─────────┼─────────┼──────┤
│ Latency │ 0 ms │ 0 ms │ 0 ms  │ 0 ms │ 0.01 ms │ 0.17 ms │ 9 ms │
└─────────┴──────┴──────┴───────┴──────┴─────────┴─────────┴──────┘
┌───────────┬────────┬────────┬────────┬────────┬──────────┬─────────┬────────┐
│ Stat      │ 1%     │ 2.5%   │ 50%    │ 97.5%  │ Avg      │ Stdev   │ Min    │
├───────────┼────────┼────────┼────────┼────────┼──────────┼─────────┼────────┤
│ Req/Sec   │ 14535  │ 14535  │ 15143  │ 16911  │ 15298.19 │ 579.94  │ 14529  │
├───────────┼────────┼────────┼────────┼────────┼──────────┼─────────┼────────┤
│ Bytes/Sec │ 858 kB │ 858 kB │ 893 kB │ 997 kB │ 903 kB   │ 34.2 kB │ 857 kB │
└───────────┴────────┴────────┴────────┴────────┴──────────┴─────────┴────────┘

As comparison a node.js http server:
┌─────────┬──────┬──────┬───────┬──────┬─────────┬─────────┬──────┐
│ Stat    │ 2.5% │ 50%  │ 97.5% │ 99%  │ Avg     │ Stdev   │ Max  │
├─────────┼──────┼──────┼───────┼──────┼─────────┼─────────┼──────┤
│ Latency │ 0 ms │ 0 ms │ 0 ms  │ 0 ms │ 0.01 ms │ 0.04 ms │ 8 ms │
└─────────┴──────┴──────┴───────┴──────┴─────────┴─────────┴──────┘
┌───────────┬─────────┬─────────┬─────────┬─────────┬──────────┬─────────┬─────────┐
│ Stat      │ 1%      │ 2.5%    │ 50%     │ 97.5%   │ Avg      │ Stdev   │ Min     │
├───────────┼─────────┼─────────┼─────────┼─────────┼──────────┼─────────┼─────────┤
│ Req/Sec   │ 45311   │ 45311   │ 65503   │ 67263   │ 60474.19 │ 8168.61 │ 45285   │
├───────────┼─────────┼─────────┼─────────┼─────────┼──────────┼─────────┼─────────┤
│ Bytes/Sec │ 7.29 MB │ 7.29 MB │ 10.6 MB │ 10.8 MB │ 9.74 MB  │ 1.32 MB │ 7.29 MB │
└───────────┴─────────┴─────────┴─────────┴─────────┴──────────┴─────────┴─────────┘

And somehow JSServe is fastest, even though it just uses HTTP almost directly.
I do have a very small routing layer, but maybe the benchmark is also noisy. The results do change quite a bit if run multiple times (took the ones best for every package after running it a few times).
Code:

using JSServe, HTTP
# JSServe usually works with route => App(...), so we need to overload this method:
JSServe.HTTPServer.apply_handler(x::HTTP.Response, context) = x
server = Server("0.0.0.0", 8083; verbose=0)
route!(server, "/" => HTTP.Response(200, "Hi"));
route!(server, "/bye" => HTTP.Response(200, "Bye!"));

using Oxygen
@get "/" () -> "hi"
@get "/bye" () -> "bye!"
serveparallel(port=8082, access_log=nothing)

using HTTP
const ROUTER = HTTP.Router()
HTTP.register!(ROUTER, "GET", "/", req -> HTTP.Response(200, "Hi"))
HTTP.register!(ROUTER, "GET", "/bye", req -> HTTP.Response(200, "Bye!"))
HTTP.serve(ROUTER; port=8081, verbose=0)

All are run with julia -tauto which defaults to 24 threads:

julia> versioninfo()
Julia Version 1.10.0-beta2
Commit a468aa198d (2023-08-17 06:27 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: 24 × AMD Ryzen 9 7900X 12-Core Processor
  Threads: 35 on 24 virtual cores

xor0110 · September 1, 2023, 9:25am

I was very interested in looking into this because I have worked a bit with HTTP.jl and I see a great potential in Julia for web. Achieving ultimate performance was never a main concern for me, but I would certainly like to see us get there, and at least find out what bottlenecks might exist.

I was pretty surprised with these results, especially for plaintext and json, and I’d like to get to the bottom of this.

I’ve ran the benchmark on my laptop for fastapi and http-jl. Here are the results:

-------------------------------------------------+
|       Type: plaintext, Result: latencyAvg       |
+---------------------------+----------+----------+
| pipelineConcurrencyLevels | fastapi  | http-jl  |
+---------------------------+----------+----------+
|            256            | 46.20ms  | 22.32ms  |
|            1024           | 161.26ms | 129.33ms |
|            4096           | 586.51ms |  1.10s   |
|           16384           |  2.72s   |  2.57s   |
+---------------------------+----------+----------+
+-----------------------------------------+
|      Type: json, Result: latencyAvg     |
+-------------------+----------+----------+
| concurrencyLevels | fastapi  | http-jl  |
+-------------------+----------+----------+
|         16        | 387.85us | 314.87us |
|         32        | 688.57us | 656.56us |
|         64        |  1.37ms  |  1.09ms  |
|        128        |  2.45ms  |  1.89ms  |
|        256        |  4.69ms  |  4.18ms  |
|        512        |  8.47ms  |  7.35ms  |
+-------------------+----------+----------+
+-------------------------------------------------+
|       Type: plaintext, Result: latencyMax       |
+---------------------------+----------+----------+
| pipelineConcurrencyLevels | fastapi  | http-jl  |
+---------------------------+----------+----------+
|            256            | 220.05ms | 103.90ms |
|            1024           | 448.10ms | 780.70ms |
|            4096           |  3.16s   |  6.89s   |
|           16384           |  7.78s   |  8.00s   |
+---------------------------+----------+----------+
+----------------------------------------+
|     Type: json, Result: latencyMax     |
+-------------------+----------+---------+
| concurrencyLevels | fastapi  | http-jl |
+-------------------+----------+---------+
|         16        | 23.35ms  | 21.28ms |
|         32        | 23.79ms  | 25.59ms |
|         64        | 33.39ms  | 28.00ms |
|        128        | 36.93ms  | 30.98ms |
|        256        | 77.85ms  | 87.31ms |
|        512        | 132.47ms | 64.00ms |
+-------------------+----------+---------+
+-----------------------------------------------+
|     Type: plaintext, Result: totalRequests    |
+---------------------------+---------+---------+
| pipelineConcurrencyLevels | fastapi | http-jl |
+---------------------------+---------+---------+
|            256            |  945736 | 1691416 |
|            1024           |  888942 | 1147127 |
|            4096           |  829934 |  545511 |
|           16384           |  728587 |  243836 |
+---------------------------+---------+---------+
+---------------------------------------+
|   Type: json, Result: totalRequests   |
+-------------------+---------+---------+
| concurrencyLevels | fastapi | http-jl |
+-------------------+---------+---------+
|         16        |  667603 |  925204 |
|         32        |  765856 | 1043969 |
|         64        |  816357 | 1135183 |
|        128        |  875290 | 1229839 |
|        256        |  895545 | 1147750 |
|        512        |  918120 | 1117826 |
+-------------------+---------+---------+

I don’t know what to say, apparently it “works for me”? I guess the machine configuration must be playing a part? I see you’ve used ARM, so maybe it’s about some optimization(s) that work much better on x86. Maybe we can figure this out first. I don’t know how the architecture can affect the results so much, but looking at this I would be less inclined to hypothesize about systematic GC or scheduling issues.

The one thing I found very curious here is why totalRequests for plaintext decays with concurrency, this seems like something we should definitely look into. How exactly is that statistic measured?

PS: I ran the benchmark from the main branch, 6d92c7b44

pankgeorg · September 1, 2023, 10:19am

For one thing, you’re not testing the PR I’ve submitted, but the existing julia 1.6 code that only has in memory datasets (but no database operations).

Also, the results you posted show that julia is collapsing under high concurrency - only being able to serve 15% of the original requests (versus 75% for fastapi).

And you are not testing against JavaScript’s fastify, which is using libuv as julia is.

And yes, Julia is fast on the good scenario, no doubt about that. My point is that

julia collapses unexpectedly on high load
the service degradation is not smooth (see how many log messages do you have)
julia isn’t rejecting/closing connections actively, it watches them being closed by siege and errors out
the collapse is much, much faster when you use a database

Of course this is also a matter of priorities. If your priority is raw speed on simple cases, then julia shines at that currently.

The reason why I believe we can be better at this is because I’m looking at the web ecosystem of Julia, having a different set of priorities than raw speed. My priorities are:

Service availability (the service remains available under high load)
Robustness (the service doesn’t error out, and recovers from errors)
Resource utilization (the service is using all available resources (cores, bandwidth) to do work with minimal management overhead (=the overhead to create tasks, do task migrations, count limits (which we don’t count), lock resources)
Speed and throughput

If you look at this picture (and actual metrics are pending on my side still, WIP), I think julia can be much, much better (threaded IO, investigate GC issues, weed out the ecosystem from blocking ccalls, deployment guides, deployment policies etc). Would you say that it works for you under these requirements?

xor0110 · September 1, 2023, 10:59am

Thanks for the clarification. Sorry it takes some time for me to catch up to what you’re doing, I’m a slow thinker. I didn’t understand there’s a separate PR at first. What are the results for you in the main branch, is it similar or was there some kind of regression in the PR? And what Julia versions are used?

My main concern looking at your results was to see Julia generally slower than all the other Python or JavaScript alternatives, that’s why I only tried one Python alternative, which seemed the fastest. And like I said, I’m focusing just on plaintext and json first because they seem simpler tests and it’s also what I’m more familiar with, and also because it’s just HTTP.jl.

Is this degradation with concurrency on the plaintext test a good representation of your concerns? I think we should definitely look into it, and figure out the source of the error messages. I have seen similar error messages but never got to look into it.

How did you get to these 75% and 15% figures?

I’m sorry if I sound snarky, I was just surprised that my numbers didn’t match yours. I’m trying to find out why. I would love to find out how to improve speed and robustness for everybody.

pankgeorg · September 1, 2023, 11:21am

xor0110:

+-----------------------------------------------+
|     Type: plaintext, Result: totalRequests    |
+---------------------------+---------+---------+
| pipelineConcurrencyLevels | fastapi | http-jl |
+---------------------------+---------+---------+
|            256            |  945736 | 1691416 |
|            1024           |  888942 | 1147127 |
|            4096           |  829934 |  545511 |
|           16384           |  728587 |  243836 |
+---------------------------+---------+---------+

I divided the number for concurrency 16384 with the number for concurrency 256.

You can follow these guidelines to run all the benchmarks:

You don’t! No worries, I expect everyone to test the code and validate or reject the results, and thank you for your time running these! I really hope there is a silver bullet that makes julia not collapse, but most likely we need to do some work on multiple levels!

Yes, it is. I’m looking for ways to fix this or make it better or at least handle it somehow

That test didn’t “validate” for me, but maybe that’s because of ARM. I’m using Julia 1.9.3. I’m more interested in the code as it is now; the initial code is too low level: it doesn’t use the HTTP.router for example. My goal is not to be as fast as possible, it is to write an application as someone reading the docs would, and test that.

xor0110 · September 1, 2023, 11:38am

Here are the results for the PR on my machine. More similar now.

+-------------------------------------------------+
|       Type: plaintext, Result: latencyAvg       |
+---------------------------+----------+----------+
| pipelineConcurrencyLevels | fastapi  | http-jl  |
+---------------------------+----------+----------+
|            256            | 69.64ms  | 110.03ms |
|            1024           | 215.15ms | 415.83ms |
|            4096           | 912.04ms |  1.71s   |
|           16384           |  3.42s   |  2.56s   |
+---------------------------+----------+----------+
+-----------------------------------------+
|      Type: json, Result: latencyAvg     |
+-------------------+----------+----------+
| concurrencyLevels | fastapi  | http-jl  |
+-------------------+----------+----------+
|         16        | 469.42us | 548.09us |
|         32        |  1.09ms  |  3.65ms  |
|         64        |  2.05ms  |  9.52ms  |
|        128        |  3.18ms  | 12.65ms  |
|        256        |  5.41ms  | 13.85ms  |
|        512        | 10.04ms  | 16.42ms  |
+-------------------+----------+----------+
+------------------------------------------------+
|      Type: plaintext, Result: latencyMax       |
+---------------------------+----------+---------+
| pipelineConcurrencyLevels | fastapi  | http-jl |
+---------------------------+----------+---------+
|            256            | 508.80ms |  1.20s  |
|            1024           | 899.86ms |  2.55s  |
|            4096           |  3.65s   |  6.82s  |
|           16384           |  8.00s   |  8.00s  |
+---------------------------+----------+---------+
+----------------------------------------+
|     Type: json, Result: latencyMax     |
+-------------------+---------+----------+
| concurrencyLevels | fastapi | http-jl  |
+-------------------+---------+----------+
|         16        | 19.99ms | 63.20ms  |
|         32        | 27.84ms | 212.20ms |
|         64        | 36.83ms | 316.00ms |
|        128        | 42.33ms | 320.01ms |
|        256        | 73.44ms | 275.95ms |
|        512        | 92.16ms | 303.26ms |
+-------------------+---------+----------+
+-----------------------------------------------+
|     Type: plaintext, Result: totalRequests    |
+---------------------------+---------+---------+
| pipelineConcurrencyLevels | fastapi | http-jl |
+---------------------------+---------+---------+
|            256            |  699697 |  905647 |
|            1024           |  679847 |  751214 |
|            4096           |  649543 |  429557 |
|           16384           |  513590 |  281671 |
+---------------------------+---------+---------+
+---------------------------------------+
|   Type: json, Result: totalRequests   |
+-------------------+---------+---------+
| concurrencyLevels | fastapi | http-jl |
+-------------------+---------+---------+
|         16        |  575246 |  578353 |
|         32        |  591786 |  484725 |
|         64        |  578934 |  563947 |
|        128        |  698312 |  529738 |
|        256        |  765819 |  587659 |
|        512        |  777563 |  610791 |
+-------------------+---------+---------+

pankgeorg · September 1, 2023, 2:12pm

Thanks for also running this. This tells us that there is probably some measurable performance impact on using the router

simsurace · September 2, 2023, 7:15am

I will try to run the tests on my machines.
Could you explain the reasoning behind this setup? Is this imposed by the benchmark?

pankgeorg · September 2, 2023, 7:33am

This provides process based parallelism and artificially makes julia seem better at the benchmark. See this note I couldn’t make the benchmark validate (i.e. do enough queries, for the db-based queries) without this. And since TechEmpower runs this with only a few cores, it fails on their system too.

I (and several others) have performance ideas to make this better, the first one being feat(server): spawn task sooner in listenloop by pankgeorg · Pull Request #1102 · JuliaWeb/HTTP.jl · GitHub (even though this benchmark doesn’t use SSL, so I only expect accept rates to be a little bit higher).

PRs, comments, questions, challenges are all very welcome (even [ESPECIALLY] if they say I’m very wrong, I want to be!)! There is nothing intrinsically wrong about julia, we can beat this

Topic		Replies	Views
HTTP.jl doesn't seem to be good at handling over 1k concurrent requests, in comparison to an alternative in Python? Web Stack	30	7559	December 16, 2020
HTTP.jl async is slow compared to python+aiohttp Performance data	48	3064	April 21, 2023
Which async web server should I use? Web Stack	25	8702	May 8, 2020
Peformance Benckmark of HTTP.jl Performance package , http	1	818	February 19, 2021
How to run a HTTP.jl server in parallel, while doing computations in the foreground? General Usage multithreading , server , httpjl	21	1237	May 8, 2024