HTTP.jl async is slow compared to python+aiohttp

Yeah, this is expected because @async in julia just spawns a task (the same as Threads.@spawn). This is a fair comparison.

Sorry, with ā€œfair comparisonā€ I am specifically trying to make sure my results from running the code written by @mkitti represent an equivalent comparison in terms of real resource usage compared to asyncio+aiohttp. As you said itā€™s a little hard for me to gauge at first glance since Julia and python are using very different paradigms for concurrency.

Let me just summarize some of the key points that you have both made to make sure I am properly understanding it:

  1. @async creates tasks on the same thread, @spawn creates task on (potentially) many threads, and @threads always creates new threads.
  2. The await / async + explicit event loop asyncio sytnax in many modern programming languages achieves concurrency on one thread while in Julia, concurrency from async is achieved using potentially many threads per CPU core.
  3. Julia 1.9 will include new capabilities to explicitly control the number of CPU cores that can be used even if the max number of threads set greatly increases.

What I want to make sure is that I have code that will also scale when only a small number of CPU cores but plenty of bandwidth is available, and I know just from looking at CPU usage that asyncio+python would work since measured CPU usage is < 100%. I am not sure if that is the case with this Julia implementation since I get the same results in terms of running time with @spawn vs @threads, so I want to make sure this isnā€™t just taking advantage of having multiple real CPU cores available or is actually optimizing for non-blocking I/O. Does this make sense?

No. Threads.@threads is also Task based. Pre-Julia 1.9, nothing creates threads within Julia. You can set the number of threads from the command line.

C++: Potentially runs on another thread, std::async - cppreference.com
Go: goroutines can run on multiple threads, although by default it runs on the current thread the go routine was called from, etc.
Libuv (node.js): Sometimes uses a thread pool for asynchronous I/O Design overview - libuv documentation

While Pythonā€™s asyncio itself uses a single thread event loop scheduler, it often hands off I/O operations to underlying C routines running on other threads, often abstracted by the operating system.

Iā€™m not sure what you are specifically referring to here. Julia 1.9 gains interactive threads. I would prefer not to discuss this here since you are already quite confused with the pre-1.9 behavior.

Threads.@spawn and Threads.@threads are not very different. The latter directly launches a Task. The former is just a construct to launch Tasks via something that looks like a for loop.

All Julia parallelism is Task based. Tasks in Julia can sometimes be distributed across multiple threads, and the number of threads must be defined at program start. This particular form of Task based parallelism is inspired by prior art in other languages and systems, particularly Cilk, Go, and Node.js.

If you would like to limit CPU usage, I would limit the number of Tasks running concurrently. See the ntasks argument of Threads.@foreach for example. You could also use locks in certainly way.

We have to be careful in direct comparisons to Python since it is often handing off I/O to C routines that release the GIL and may use threads under the hood. In contrast, much of the same I/O in Julia is being directly managed by Julia with some assistance from libuv to provide operating system level abstractions.

4 Likes

Before I start this reply I just want to be clear that I do appreciate you and @Oscar_Smith trying to take the time to explain this stuff in detail and I am sorry if my confusion has caused any frustration on your part. If itā€™s not clear I am a relative newcomer to Julia with a lot of background in Python doing big data tasks but have been really drawn to some of the many cool features this language has to offer. This led me to trying to come up with further improvements to Zarr.jl to bring it to parity (in terms of performance) with zarr-python since Iā€™d like to take advantage of some numerical code I have written that runs several times faster in Julia since the input data will be in zarr format.

Before I made this thread a lot of examples I have seen demonstrating how to process large numbers of requests using either HTTP.jl or Downloads.jl were mainly using the @async or the asyncmap macro so one thing I hope that comes out of this discussion (besides figuring out the best way to proceed with our proposed improvements to Zarr.jl) is to make the best practices for processing many HTTP requests in Julia a bit more clear, particularly for those like me who donā€™t fully understand the threading and async paradigm thatā€™s used in Julia yet.

The origin of my confusion of this point is this statement:

These languages have a feature called tasks (also known as green-threads in some of the research) which allow the language to use dramatically more threads than you have cores.

So reading this I am now under the impression that I can get great concurrency even if I have access to a small number of CPU cores available and that I can achieve this by starting Julia with a larger number of threads than I have cores. The problem is that when I do this on the machine I have been testing on the threads are generally being ran on multiple cores. This is why I was interested in knowing if I could increase the number of threads in Julia while limiting the number of cores that could be used so I can test if the performance gain would still hold when having a smaller number of cores but large number of threads. When you stated:

In case it was not clear above, Threads.@spawn does not spawn a thread. It spawns a Task that can run on another thread. The ability to create new threads only arrives in Julia 1.9. Thus, you may be able to limit CPU activity by controlling when you use Threads.@spawn.

My interpretation was that if I need a convenient way to limit the number of CPU cores Julia will use while increasing the number of threads, Iā€™ll need to wait until Julia 1.9. Was this what you were referring to when mentioning interactive threads?

We have to be careful in direct comparisons to Python since it is often handing off I/O to C routines that release the GIL and may use threads under the hood. In contrast, much of the same I/O in Julia is being directly managed by Julia with some assistance from libuv to provide operating system level abstractions.

The origin of my concern is coming from how I am measuring CPU usage. Each time I run these examples in both python and Julia I am monitoring CPU usage through top, and what I am finding is that without fail CPU usage is significantly lower when using asyncio+python than in Julia when in the latter I use more than one thread. If you have better suggestions for how I should monitor or measure this please let me know, but I am thinking it may be best for me to just spin up an EC2 instance with a very small number of CPUs and high bandwidth, then run this again to see if running the Julia threaded request code with at least 10 threads still matches my asyncio+aiohttp code. That is the main question I am still curious about.

Again, I appreciate the patience and help!

2 Likes

So in Julia --threads sets the number of OS threads (think cores, although thatā€™s technically not quite correct). @spawn or @async in Julia will both create tasks which will be farmed out to available threads. This means that if you have more threads than tasks, spawning more tasks will increase CPU usage (but hopefully reduce time), while if you have more tasks than threads, spawning more tasks will just give more work to each of the threads that you have.

Python doesnā€™t have multi-threading (except when calling out to C) so async code in python will be capped at 1 core (+ whatever C is doing). If you want to limit the CPU usage of Julia you can do that by setting --threads.

2 Likes

Ok, so I decided to finally set up a test AWS EC2 instance with the latest release of Julia installed on it and ran both my original asyncio+aiohttp example and the proposed threaded HTTP.jl implementation on an r5n.large instance:

OS: Ubuntu 22.04
CPUs: 1
RAM: 16 GB
Bandwidth: 1784.24 Mbit/s

Python asyncio + aiohttp

In [14]: %time cdatas = asyncio.run(get_items(keys))
CPU times: user 5.33 s, sys: 5.84 s, total: 11.2 s
Wall time: 12.2 s

Julia (@mkitti stream buffered HTTP.jl with threads)

@time f();
 37.600126 seconds (1.73 M allocations: 181.642 MiB)

So I think this time the results should leave very little room for doubt as now I can be certain that no more than 1 CPU can be used by Python, Julia, or any external calls to C libraries.

7 Likes

So if Iā€™m understanding right, thatā€™s ~2800 MB in 12s vs 37s? So the python one seems to be saturating the network bandwidth (1867 Mbit / s) while the Julia one is 3x away from that cap. Pretty impressive on the python side while it looks like thereā€™s a ways to go for Julia.

3 Likes

Honestly speaking, I never expect HTTP.jl to be a high performant package. It contains too many dynamic dispatches. The way HTTP.jl implements nested layers for HTTP processing is tricky. A lot of higher-order functions are chained together and Juliaā€™s type inferencer doesnā€™t handle them well, although some type guards are added to prevent propagation of dynamic types. But I donā€™t know whether this significantly contributes to the slowdown here. Also, the functions are overly specialized and produce a huge tuple type. AFAIK, this is more or less unnecessary, since we can use a unified interface (implied for type stablity) for each layer of HTTP server and prevent the over-specification.

2 Likes

A simple test of connection overhead would be to change the HTTP request from ā€œGETā€ to ā€œHEADā€. This will just get the header and not transfer any actual data:

julia> const urls = map(i->"https://mur-sst.s3.us-west-2.amazonaws.com/zarr-v1/analysed_sst/$i.1.0", 0:99);

julia> function f()
           @sync map(urls) do url
               Threads.@spawn HTTP.request("HEAD", url, status_exception=false, connection_limit=25)
           end
       end

...

julia> @time f();
  7.836372 seconds (18.71 k allocations: 2.988 MiB)

julia> const buffers = [IOBuffer(; sizehint = 64*1024*1024, maxsize=64*1024*1024) for x in 1:100];
WARNING: redefinition of constant buffers. This may fail, cause incorrect answers, or produce other errors.

julia> function f()
           seekstart.(buffers)
           @sync map(urls, buffers) do url, buffer
               Threads.@spawn HTTP.request("HEAD", url, status_exception=false, connection_limit=25, response_stream=buffer)
           end
       end
f (generic function with 1 method)

julia> @time f();
  8.094062 seconds (24.58 k allocations: 3.448 MiB, 0.95% compilation time)

julia> @time f();
  7.382402 seconds (18.29 k allocations: 2.908 MiB)

julia> Threads.nthreads()
8

Thatā€™s exactly right, and the numbers for bandwidth I had derived were done independently through the speedtest cli. You can readily see from doing the math by dividing the total amount of data (~2.8 GB) with the the running time matches up with the expected bandwidth, implying that asyncio+aiohttp is indeed fully saturating my network connection when processing many requests.

I have read in other threads multiple times that HTTP.jl still has room for better optimization, but from what I can tell at least when processing single requests at a time itā€™s about the same speed as python requests.

So a follow up question I have is this: Does HTTP.jl perform the I/O for writing the response content in a non-blocking fashion? I would think this could be a more likely explanation for the massive differences I am seeing. This is why I brought up the possibility of there needing to be a dedicated API for async HTTP requests.

A dedicated API for async I/O on the julia side wonā€™t save you, as thatā€™s exactly what youā€™ve already done with @async. The issue here seems to lie a little deeper, and while I donā€™t know how python handles its I/O, I do know that julia (currently) funnels all I/O through a single task, which likely becomes your bottleneck here.

You can follow some of that discussion in these issues:

The general consensus is that for us to get much better (async) I/O, ideally libuv (which julia uses for tasks & I/O) would get support for io_uring, and julia subsequently making use of that to do I/O on a per-task basis.

Youā€™ll likely find more discussions about this if you do a little digging here and on the issue tracker. These two are just the ones I know OTOH to be related to the underlying issue.

1 Like

There are too many factors that may impact the benchmark here, especially we donā€™t how aiohttp configures its http connection. From the aiohttpā€™s document, it seems that aiohttp uses a default connection pool size of 100 and thereā€™s no limit placed on a single end point (so you can have 100 connections to a single server simutaenously). aiohttps also preallocates a buffer internally.

I think out of every possible explanation so far this sounds the most likely to me since this example is clearly I/O bound. So in other words, if I queue up dozens of these requests together into tasks and attempt to run them through @async, they still have to essentially wait in a queue upon the point where I/O is done on the response content?

I think itā€™s clear at least for this example that stream buffer allocation and connection pool size arenā€™t the main culprits. I have ran the implementation given in this thread and just using a simple asyncmap and found virtually no difference, nearly all of the performance improvement came from using multiple CPU cores. I should also add that I have also tested this with a different set of 100 chunks (replacing $i.1.0 with $i.0.0 in the urls) in this zarr store which are much smaller (by around a factor of 50) and found that just using asyncmap and HTTP.jl with mostly default parameters resulted in around a 1 second difference in speed compared to aiohttp on my connection. (The total times now being around 3-4 s).

I also went ahead and decided to wrap each HTTP.jl request in a time macro to time how long it takes for each of the tasks to complete and this is the result I got:

julia> asyncmap(urls) do url
           @time HTTP.request("GET", url, status_excetion=false).body;
       end
  1.498204 seconds (71.60 k allocations: 202.762 MiB)
  1.584878 seconds (74.87 k allocations: 226.957 MiB)
  1.683540 seconds (80.07 k allocations: 254.672 MiB)
  3.024995 seconds (138.94 k allocations: 474.744 MiB, 0.16% gc time)
  3.025557 seconds (139.09 k allocations: 474.752 MiB, 0.16% gc time)
  3.024744 seconds (138.36 k allocations: 474.714 MiB, 0.16% gc time)
  3.024852 seconds (138.59 k allocations: 474.729 MiB, 0.16% gc time)
  3.024794 seconds (138.48 k allocations: 474.724 MiB, 0.16% gc time)
  3.024834 seconds (138.58 k allocations: 474.731 MiB, 0.16% gc time)
  3.109919 seconds (141.88 k allocations: 535.341 MiB, 0.15% gc time)
  3.894247 seconds (179.81 k allocations: 673.940 MiB, 0.12% gc time)
  4.105078 seconds (187.39 k allocations: 693.591 MiB, 0.12% gc time)
  5.956837 seconds (270.63 k allocations: 993.055 MiB, 0.08% gc time)
  6.894239 seconds (312.51 k allocations: 1.098 GiB, 0.07% gc time)
  7.672049 seconds (348.05 k allocations: 1.249 GiB, 0.06% gc time)
  8.882742 seconds (400.40 k allocations: 1.415 GiB, 0.05% gc time)
 10.082127 seconds (453.07 k allocations: 1.570 GiB, 0.05% gc time)
 10.105748 seconds (454.01 k allocations: 1.594 GiB, 0.05% gc time)
 10.152388 seconds (457.25 k allocations: 1.616 GiB, 0.05% gc time)
 13.815763 seconds (612.72 k allocations: 2.178 GiB, 0.03% gc time)
 14.774654 seconds (654.03 k allocations: 2.333 GiB, 0.03% gc time)
 16.204886 seconds (710.15 k allocations: 2.525 GiB, 0.19% gc time)
 17.247796 seconds (754.17 k allocations: 2.716 GiB, 0.18% gc time)
 18.236703 seconds (795.82 k allocations: 2.904 GiB, 0.17% gc time)
 19.524913 seconds (848.88 k allocations: 3.050 GiB, 0.16% gc time)
 20.563678 seconds (894.77 k allocations: 3.166 GiB, 0.15% gc time)
 20.576568 seconds (894.52 k allocations: 3.194 GiB, 0.15% gc time)
 22.478309 seconds (975.11 k allocations: 3.539 GiB, 0.14% gc time)
 22.500459 seconds (975.87 k allocations: 3.564 GiB, 0.14% gc time)
 24.344022 seconds (1.06 M allocations: 3.784 GiB, 0.13% gc time)
 24.364351 seconds (1.06 M allocations: 3.808 GiB, 0.13% gc time)
 25.924377 seconds (1.13 M allocations: 4.070 GiB, 0.12% gc time)
 26.861296 seconds (1.18 M allocations: 4.186 GiB, 0.11% gc time)
 26.877835 seconds (1.18 M allocations: 4.207 GiB, 0.11% gc time)
 28.733099 seconds (1.26 M allocations: 4.495 GiB, 0.11% gc time)
 29.448074 seconds (1.29 M allocations: 4.633 GiB, 0.10% gc time)
 29.970573 seconds (1.31 M allocations: 4.703 GiB, 0.10% gc time)
 30.001763 seconds (1.31 M allocations: 4.724 GiB, 0.10% gc time)
 25.983917 seconds (1.13 M allocations: 4.088 GiB, 0.12% gc time)
 24.410209 seconds (1.06 M allocations: 3.832 GiB, 0.12% gc time)
 24.545431 seconds (1.07 M allocations: 3.855 GiB, 0.12% gc time)
 22.541355 seconds (977.00 k allocations: 3.586 GiB, 0.14% gc time)
 20.589363 seconds (895.02 k allocations: 3.224 GiB, 0.15% gc time)
 20.683434 seconds (898.20 k allocations: 3.251 GiB, 0.15% gc time)
 19.666868 seconds (855.40 k allocations: 3.080 GiB, 0.15% gc time)
 17.282256 seconds (755.40 k allocations: 2.748 GiB, 0.18% gc time)
 14.818857 seconds (655.14 k allocations: 2.364 GiB, 0.03% gc time)
  4.372033 seconds (198.08 k allocations: 732.993 MiB, 0.11% gc time)
  4.673303 seconds (210.14 k allocations: 753.284 MiB, 0.10% gc time)
  4.745032 seconds (212.39 k allocations: 772.561 MiB, 0.10% gc time)
  4.865894 seconds (219.56 k allocations: 790.372 MiB, 0.10% gc time)
  4.905643 seconds (220.96 k allocations: 809.802 MiB, 0.10% gc time)
  5.301046 seconds (238.93 k allocations: 881.393 MiB, 0.09% gc time)
  5.423248 seconds (243.31 k allocations: 900.309 MiB, 0.09% gc time)
  6.461030 seconds (291.76 k allocations: 1009.892 MiB, 0.07% gc time)
  6.489308 seconds (293.82 k allocations: 1.005 GiB, 0.07% gc time)
  6.554376 seconds (295.84 k allocations: 1.026 GiB, 0.07% gc time)
  6.562033 seconds (295.86 k allocations: 1.044 GiB, 0.07% gc time)
  6.985161 seconds (315.61 k allocations: 1.115 GiB, 0.07% gc time)
  7.224906 seconds (327.21 k allocations: 1.169 GiB, 0.07% gc time)
  8.045138 seconds (363.88 k allocations: 1.268 GiB, 0.06% gc time)
  8.316703 seconds (375.90 k allocations: 1.341 GiB, 0.06% gc time)
  8.359597 seconds (377.27 k allocations: 1.362 GiB, 0.06% gc time)
  9.585860 seconds (431.03 k allocations: 1.491 GiB, 0.05% gc time)
 10.470259 seconds (468.54 k allocations: 1.641 GiB, 0.05% gc time)
 10.693356 seconds (477.47 k allocations: 1.694 GiB, 0.04% gc time)
 10.797540 seconds (481.09 k allocations: 1.722 GiB, 0.04% gc time)
 11.227947 seconds (500.96 k allocations: 1.805 GiB, 0.04% gc time)
 11.798418 seconds (525.31 k allocations: 1.917 GiB, 0.04% gc time)
 12.838558 seconds (571.34 k allocations: 2.003 GiB, 0.04% gc time)
 12.954962 seconds (575.40 k allocations: 2.032 GiB, 0.04% gc time)
 13.468266 seconds (598.19 k allocations: 2.090 GiB, 0.04% gc time)
 13.619779 seconds (603.86 k allocations: 2.118 GiB, 0.03% gc time)
 14.280381 seconds (632.41 k allocations: 2.272 GiB, 0.03% gc time)
 15.552188 seconds (685.32 k allocations: 2.461 GiB, 0.03% gc time)
 16.665511 seconds (729.84 k allocations: 2.622 GiB, 0.18% gc time)
 17.104919 seconds (748.93 k allocations: 2.654 GiB, 0.18% gc time)
 17.702257 seconds (773.03 k allocations: 2.812 GiB, 0.17% gc time)
 18.827440 seconds (821.01 k allocations: 2.992 GiB, 0.16% gc time)
 20.839953 seconds (904.11 k allocations: 3.279 GiB, 0.15% gc time)
 21.140093 seconds (917.03 k allocations: 3.354 GiB, 0.14% gc time)
 21.448693 seconds (928.68 k allocations: 3.403 GiB, 0.14% gc time)
 23.109302 seconds (1.00 M allocations: 3.629 GiB, 0.13% gc time)
 23.437409 seconds (1.02 M allocations: 3.672 GiB, 0.13% gc time)
 23.716347 seconds (1.03 M allocations: 3.742 GiB, 0.13% gc time)
 24.697471 seconds (1.07 M allocations: 3.879 GiB, 0.12% gc time)
 24.900346 seconds (1.08 M allocations: 3.938 GiB, 0.12% gc time)
 25.121958 seconds (1.09 M allocations: 3.996 GiB, 0.12% gc time)
 26.241008 seconds (1.15 M allocations: 4.128 GiB, 0.12% gc time)
 26.437471 seconds (1.15 M allocations: 4.166 GiB, 0.12% gc time)
 27.156992 seconds (1.19 M allocations: 4.294 GiB, 0.11% gc time)
 27.163067 seconds (1.19 M allocations: 4.313 GiB, 0.11% gc time)
 27.703696 seconds (1.21 M allocations: 4.385 GiB, 0.11% gc time)
 28.238644 seconds (1.24 M allocations: 4.452 GiB, 0.11% gc time)
 28.818694 seconds (1.26 M allocations: 4.521 GiB, 0.11% gc time)
 28.916111 seconds (1.26 M allocations: 4.542 GiB, 0.11% gc time)
 28.961307 seconds (1.27 M allocations: 4.566 GiB, 0.11% gc time)
 30.179781 seconds (1.32 M allocations: 4.748 GiB, 0.10% gc time)
 30.315067 seconds (1.33 M allocations: 4.772 GiB, 0.10% gc time)
 30.818377 seconds (1.33 M allocations: 4.796 GiB, 0.10% gc time)

So looking at this to me it seems like that several tasks are spending time waiting for others to finish.

can someone tell me in general, is HTTP.jl more to be blamed or Julia itself?

In other words, without re-writing HTTP.jl, can we fix Julia and somehow get a 10x speed up for free? is that how bad Juliaā€™s (libuv) I/O is?

or is the blame 50/50 between HTTP.jl and Julia

2 Likes

You can only open at most connection_limits TCP socket concurrently to a single host, no matter whether you use asyncmap, because the connection pool is shared globally. This parameter is set as 8 by default. So itā€™s quite nature they wait for each other. On python I can open 100 sockets, because there is no limitation by default (I test by using lsof to show all the sockets) .

As I have said repeatedly, increasing the connection_limit made little to no difference in my results. here it is again but set to 100:

julia> asyncmap(urls) do url
           @time HTTP.request("GET", url, status_excetion=false, connection_limit=100).body;
       end
  1.627150 seconds (180.25 k allocations: 239.652 MiB, 7.50% compilation time)
  1.971745 seconds (196.17 k allocations: 305.103 MiB, 6.19% compilation time)
  2.637319 seconds (905.60 k allocations: 370.130 MiB, 21.77% compilation time)
  3.127438 seconds (260.31 k allocations: 415.653 MiB, 3.90% compilation time)
  4.165361 seconds (301.74 k allocations: 616.611 MiB, 2.93% compilation time)
  4.191165 seconds (303.19 k allocations: 653.327 MiB, 2.91% compilation time)
  3.268541 seconds (267.56 k allocations: 459.963 MiB, 3.73% compilation time)
  6.561413 seconds (408.64 k allocations: 1.017 GiB, 1.86% compilation time)
  6.654041 seconds (412.41 k allocations: 1.034 GiB, 1.83% compilation time)
  8.393001 seconds (493.65 k allocations: 1.319 GiB, 1.45% compilation time)
  8.628257 seconds (504.08 k allocations: 1.338 GiB, 1.41% compilation time)
 10.653512 seconds (596.00 k allocations: 1.659 GiB, 1.15% compilation time)
  3.355595 seconds (271.24 k allocations: 485.215 MiB, 3.64% compilation time)
 13.306050 seconds (711.41 k allocations: 2.095 GiB, 0.92% compilation time)
 14.634454 seconds (765.75 k allocations: 2.327 GiB, 0.83% compilation time)
 15.498261 seconds (799.35 k allocations: 2.421 GiB, 0.79% compilation time)
 17.195515 seconds (871.05 k allocations: 2.644 GiB, 0.71% compilation time)
 15.731048 seconds (810.11 k allocations: 2.517 GiB, 0.78% compilation time)
 19.946894 seconds (981.93 k allocations: 3.114 GiB, 0.43% gc time, 0.61% compilation time)
  4.070846 seconds (298.95 k allocations: 551.357 MiB, 3.00% compilation time)
 22.267913 seconds (1.08 M allocations: 3.523 GiB, 0.38% gc time, 0.55% compilation time)
 23.687561 seconds (1.15 M allocations: 3.711 GiB, 0.36% gc time, 0.52% compilation time)
  4.103142 seconds (299.89 k allocations: 574.344 MiB, 2.97% compilation time)
 25.849148 seconds (1.25 M allocations: 4.014 GiB, 0.33% gc time, 0.47% compilation time)
 25.896188 seconds (1.25 M allocations: 4.035 GiB, 0.33% gc time, 0.47% compilation time)
 25.928759 seconds (1.25 M allocations: 4.060 GiB, 0.33% gc time, 0.47% compilation time)
 28.355349 seconds (1.36 M allocations: 4.436 GiB, 0.30% gc time, 0.43% compilation time)
 28.533216 seconds (1.37 M allocations: 4.481 GiB, 0.30% gc time, 0.43% compilation time)
 30.385794 seconds (1.46 M allocations: 4.749 GiB, 0.28% gc time, 0.40% compilation time)
 30.644589 seconds (1.47 M allocations: 4.775 GiB, 0.28% gc time, 0.40% compilation time)
 31.335957 seconds (1.50 M allocations: 4.917 GiB, 0.27% gc time, 0.39% compilation time)
 26.051017 seconds (1.26 M allocations: 4.079 GiB, 0.33% gc time, 0.47% compilation time)
 26.158018 seconds (1.26 M allocations: 4.098 GiB, 0.33% gc time, 0.47% compilation time)
  4.128012 seconds (300.65 k allocations: 596.515 MiB, 2.96% compilation time)
  4.656742 seconds (324.90 k allocations: 730.474 MiB, 2.62% compilation time)
  4.868796 seconds (333.89 k allocations: 767.838 MiB, 2.51% compilation time)
  5.180329 seconds (349.92 k allocations: 853.447 MiB, 2.36% compilation time)
  5.947475 seconds (382.07 k allocations: 893.663 MiB, 2.05% compilation time)
  6.018952 seconds (384.41 k allocations: 911.263 MiB, 2.03% compilation time)
  6.056731 seconds (385.88 k allocations: 930.552 MiB, 2.01% compilation time)
  6.178245 seconds (391.15 k allocations: 968.535 MiB, 1.97% compilation time)
  6.209499 seconds (392.28 k allocations: 988.743 MiB, 1.96% compilation time)
  6.764227 seconds (417.17 k allocations: 1.050 GiB, 1.80% compilation time)
  7.688869 seconds (461.71 k allocations: 1.173 GiB, 1.59% compilation time)
  7.709633 seconds (462.27 k allocations: 1.191 GiB, 1.58% compilation time)
  7.884968 seconds (468.87 k allocations: 1.210 GiB, 1.55% compilation time)
  7.998217 seconds (475.02 k allocations: 1.244 GiB, 1.53% compilation time)
  8.281926 seconds (489.03 k allocations: 1.302 GiB, 1.47% compilation time)
  8.657492 seconds (505.06 k allocations: 1.359 GiB, 1.41% compilation time)
  9.435942 seconds (542.20 k allocations: 1.475 GiB, 1.29% compilation time)
  9.778021 seconds (557.47 k allocations: 1.546 GiB, 1.25% compilation time)
  9.894134 seconds (561.71 k allocations: 1.570 GiB, 1.23% compilation time)
 10.729999 seconds (598.62 k allocations: 1.684 GiB, 1.14% compilation time)
 11.026313 seconds (610.35 k allocations: 1.710 GiB, 1.11% compilation time)
 11.203209 seconds (618.80 k allocations: 1.762 GiB, 1.09% compilation time)
 11.286361 seconds (621.52 k allocations: 1.788 GiB, 1.08% compilation time)
 11.739954 seconds (642.38 k allocations: 1.870 GiB, 1.04% compilation time)
 12.310236 seconds (668.70 k allocations: 1.955 GiB, 0.99% compilation time)
 12.580505 seconds (680.82 k allocations: 2.038 GiB, 0.97% compilation time)
 13.934702 seconds (737.03 k allocations: 2.153 GiB, 0.88% compilation time)
 14.045680 seconds (740.61 k allocations: 2.180 GiB, 0.87% compilation time)
 14.121815 seconds (743.19 k allocations: 2.209 GiB, 0.86% compilation time)
 14.389820 seconds (755.57 k allocations: 2.267 GiB, 0.85% compilation time)
 16.536674 seconds (845.69 k allocations: 2.612 GiB, 0.74% compilation time)
 17.371853 seconds (877.23 k allocations: 2.677 GiB, 0.70% compilation time)
 17.600820 seconds (887.59 k allocations: 2.742 GiB, 0.69% compilation time)
 17.891975 seconds (900.03 k allocations: 2.806 GiB, 0.68% compilation time)
 18.036828 seconds (905.58 k allocations: 2.837 GiB, 0.68% compilation time)
 18.844343 seconds (939.05 k allocations: 2.933 GiB, 0.65% compilation time)
 19.107938 seconds (946.71 k allocations: 2.996 GiB, 0.45% gc time, 0.64% compilation time)
 20.577683 seconds (1.01 M allocations: 3.146 GiB, 0.41% gc time, 0.59% compilation time)
 20.599467 seconds (1.01 M allocations: 3.177 GiB, 0.41% gc time, 0.59% compilation time)
 20.838376 seconds (1.02 M allocations: 3.234 GiB, 0.41% gc time, 0.59% compilation time)
 20.905678 seconds (1.02 M allocations: 3.263 GiB, 0.41% gc time, 0.58% compilation time)
 21.135316 seconds (1.03 M allocations: 3.318 GiB, 0.40% gc time, 0.58% compilation time)
 21.707223 seconds (1.05 M allocations: 3.371 GiB, 0.39% gc time, 0.56% compilation time)
 21.718757 seconds (1.06 M allocations: 3.400 GiB, 0.39% gc time, 0.56% compilation time)
 23.085712 seconds (1.12 M allocations: 3.571 GiB, 0.37% gc time, 0.53% compilation time)
 23.174477 seconds (1.12 M allocations: 3.596 GiB, 0.37% gc time, 0.53% compilation time)
 23.280126 seconds (1.13 M allocations: 3.619 GiB, 0.37% gc time, 0.52% compilation time)
 23.367401 seconds (1.13 M allocations: 3.644 GiB, 0.37% gc time, 0.52% compilation time)
 24.094616 seconds (1.17 M allocations: 3.756 GiB, 0.35% gc time, 0.51% compilation time)
 24.130443 seconds (1.17 M allocations: 3.777 GiB, 0.35% gc time, 0.51% compilation time)
 24.661552 seconds (1.19 M allocations: 3.889 GiB, 0.35% gc time, 0.49% compilation time)
 25.413953 seconds (1.23 M allocations: 3.932 GiB, 0.34% gc time, 0.48% compilation time)
 25.780707 seconds (1.25 M allocations: 3.991 GiB, 0.33% gc time, 0.47% compilation time)
 26.518146 seconds (1.28 M allocations: 4.154 GiB, 0.32% gc time, 0.46% compilation time)
 26.987666 seconds (1.30 M allocations: 4.269 GiB, 0.32% gc time, 0.45% compilation time)
 27.850579 seconds (1.34 M allocations: 4.311 GiB, 0.31% gc time, 0.44% compilation time)
 27.971130 seconds (1.35 M allocations: 4.352 GiB, 0.31% gc time, 0.44% compilation time)
 28.011825 seconds (1.35 M allocations: 4.371 GiB, 0.30% gc time, 0.44% compilation time)
 28.075721 seconds (1.35 M allocations: 4.391 GiB, 0.30% gc time, 0.43% compilation time)
 28.208013 seconds (1.36 M allocations: 4.412 GiB, 0.30% gc time, 0.43% compilation time)
 29.114576 seconds (1.40 M allocations: 4.616 GiB, 0.29% gc time, 0.42% compilation time)
 29.582178 seconds (1.42 M allocations: 4.640 GiB, 0.29% gc time, 0.41% compilation time)
 29.940283 seconds (1.44 M allocations: 4.684 GiB, 0.29% gc time, 0.41% compilation time)
 30.273651 seconds (1.45 M allocations: 4.728 GiB, 0.28% gc time, 0.40% compilation time)
 30.674875 seconds (1.47 M allocations: 4.796 GiB, 0.28% gc time, 0.40% compilation time)
 30.824184 seconds (1.48 M allocations: 4.844 GiB, 0.28% gc time, 0.40% compilation time)
 31.270775 seconds (1.50 M allocations: 4.893 GiB, 0.27% gc time, 0.39% compilation time)

Sorry to double post but I decided to try checking lsof to see open sockets and I see:

lsof -c julia | grep s3-us-west-2-r-w.amazonaws.com:https
lsof: WARNING: can't stat() tracefs file system /sys/kernel/debug/tracing
      Output information may be incomplete.
julia   271901 goodman   20u     IPv4 7561660       0t0          TCP <redacted>:45064->s3-us-west-2-r-w.amazonaws.com:https (ESTABLISHED)
julia   271901 goodman   21u     IPv4 7561661       0t0          TCP <redacted>:45080->s3-us-west-2-r-w.amazonaws.com:https (ESTABLISHED)
julia   271901 goodman   22u     IPv4 7561662       0t0          TCP <redacted>:45082->s3-us-west-2-r-w.amazonaws.com:https (ESTABLISHED)
julia   271901 goodman   23u     IPv4 7561663       0t0          TCP <redacted>:45084->s3-us-west-2-r-w.amazonaws.com:https (ESTABLISHED)
julia   271901 goodman   24u     IPv4 7561664       0t0          TCP <redacted>:45088->s3-us-west-2-r-w.amazonaws.com:https (ESTABLISHED)
julia   271901 goodman   25u     IPv4 7561665       0t0          TCP <redacted>:45092->s3-us-west-2-r-w.amazonaws.com:https (ESTABLISHED)
julia   271901 goodman   26u     IPv4 7561666       0t0          TCP <redacted>:45094->s3-us-west-2-r-w.amazonaws.com:https (ESTABLISHED)
julia   271901 goodman   27u     IPv4 7561667       0t0          TCP <redacted>:45098->s3-us-west-2-r-w.amazonaws.com:https (ESTABLISHED)

I am always seeing 8 sockets no matter what I set connection_limit to. So perhaps a bug in HTTP.jl?

1 Like

Alright, I think I have managed to make some pretty important observations:

  1. HTTP.jl seems to ignore connection_limit after the first time
  2. Regardless of how high connection_limit gets set I only see around 20-30 concurrent connections max.
  3. Increasing connection_limit didnā€™t improve the running time to process all 100 requests (if anything, I am seeing slightly worse results) but made more of them finish at the same time (eg more tasks starting concurrently but taking proportionally longer to finish). Example:
julia> asyncmap(urls) do url
           @time HTTP.request("GET", url, status_exception=false, connection_limit=100).body;
       end
  0.948380 seconds (33.16 k allocations: 139.280 MiB)
  2.961647 seconds (103.75 k allocations: 457.984 MiB)
  4.462301 seconds (160.76 k allocations: 725.730 MiB)
  4.818988 seconds (176.62 k allocations: 804.385 MiB)
  8.271751 seconds (292.88 k allocations: 1.305 GiB)
  9.714883 seconds (345.51 k allocations: 1.544 GiB)
 11.498342 seconds (410.16 k allocations: 1.869 GiB)
 14.208980 seconds (509.78 k allocations: 2.293 GiB)
 14.333130 seconds (514.10 k allocations: 2.321 GiB)
 14.497057 seconds (520.65 k allocations: 2.350 GiB)
 21.893111 seconds (784.32 k allocations: 3.599 GiB)
 24.284558 seconds (870.61 k allocations: 3.985 GiB)
 24.352982 seconds (872.85 k allocations: 4.010 GiB)
 25.268930 seconds (913.36 k allocations: 4.128 GiB)
 25.305595 seconds (914.71 k allocations: 4.153 GiB)
 25.339386 seconds (915.59 k allocations: 4.177 GiB)
 33.214921 seconds (1.19 M allocations: 5.444 GiB)
 33.968569 seconds (1.22 M allocations: 5.570 GiB)
 34.240627 seconds (1.24 M allocations: 5.621 GiB)
 34.240651 seconds (1.24 M allocations: 5.621 GiB)
 34.685784 seconds (1.25 M allocations: 5.706 GiB)
 34.240617 seconds (1.24 M allocations: 5.621 GiB)
 34.240648 seconds (1.24 M allocations: 5.621 GiB)
 34.240607 seconds (1.24 M allocations: 5.621 GiB)
 34.240508 seconds (1.24 M allocations: 5.621 GiB)
 34.240496 seconds (1.24 M allocations: 5.621 GiB)
 34.240484 seconds (1.24 M allocations: 5.621 GiB)
 34.240468 seconds (1.24 M allocations: 5.621 GiB)
 34.240479 seconds (1.24 M allocations: 5.621 GiB)
 34.240379 seconds (1.24 M allocations: 5.621 GiB)
 34.240421 seconds (1.24 M allocations: 5.621 GiB)
 34.240342 seconds (1.24 M allocations: 5.621 GiB)
 34.240375 seconds (1.24 M allocations: 5.621 GiB)
 34.240338 seconds (1.24 M allocations: 5.621 GiB)
 34.240291 seconds (1.24 M allocations: 5.621 GiB)
 34.240328 seconds (1.24 M allocations: 5.621 GiB)
 34.240290 seconds (1.24 M allocations: 5.621 GiB)
 34.240277 seconds (1.24 M allocations: 5.621 GiB)
 34.240270 seconds (1.24 M allocations: 5.621 GiB)
 34.240261 seconds (1.24 M allocations: 5.621 GiB)
 34.240231 seconds (1.24 M allocations: 5.621 GiB)
 34.240199 seconds (1.24 M allocations: 5.621 GiB)
 34.240186 seconds (1.24 M allocations: 5.621 GiB)
 34.240178 seconds (1.24 M allocations: 5.621 GiB)
 34.240127 seconds (1.24 M allocations: 5.621 GiB)
 34.240153 seconds (1.24 M allocations: 5.621 GiB)
 34.240123 seconds (1.24 M allocations: 5.621 GiB)
 34.240095 seconds (1.24 M allocations: 5.621 GiB)
 34.240074 seconds (1.24 M allocations: 5.621 GiB)
 34.240042 seconds (1.24 M allocations: 5.621 GiB)
 34.240032 seconds (1.24 M allocations: 5.621 GiB)
 34.239964 seconds (1.24 M allocations: 5.621 GiB)
 34.239936 seconds (1.24 M allocations: 5.621 GiB)
 34.239922 seconds (1.24 M allocations: 5.621 GiB)
 34.239971 seconds (1.24 M allocations: 5.621 GiB)
 34.239922 seconds (1.24 M allocations: 5.621 GiB)
 34.240345 seconds (1.24 M allocations: 5.621 GiB)
 34.239917 seconds (1.24 M allocations: 5.621 GiB)
 34.239910 seconds (1.24 M allocations: 5.621 GiB)
 34.239897 seconds (1.24 M allocations: 5.621 GiB)
 34.239884 seconds (1.24 M allocations: 5.621 GiB)
 34.239874 seconds (1.24 M allocations: 5.621 GiB)
 34.239862 seconds (1.24 M allocations: 5.621 GiB)
 34.239845 seconds (1.24 M allocations: 5.621 GiB)
 34.239830 seconds (1.24 M allocations: 5.621 GiB)
 34.239819 seconds (1.24 M allocations: 5.621 GiB)
 34.239808 seconds (1.24 M allocations: 5.621 GiB)
 34.239777 seconds (1.24 M allocations: 5.621 GiB)
 34.239765 seconds (1.24 M allocations: 5.621 GiB)
 34.239754 seconds (1.24 M allocations: 5.621 GiB)
 34.239742 seconds (1.24 M allocations: 5.621 GiB)
 34.239729 seconds (1.24 M allocations: 5.621 GiB)
 34.239557 seconds (1.24 M allocations: 5.621 GiB)
 34.239586 seconds (1.24 M allocations: 5.621 GiB)
 34.239635 seconds (1.24 M allocations: 5.621 GiB)
 34.239624 seconds (1.24 M allocations: 5.621 GiB)
 34.239571 seconds (1.24 M allocations: 5.621 GiB)
 34.239558 seconds (1.24 M allocations: 5.621 GiB)
 34.239546 seconds (1.24 M allocations: 5.621 GiB)
 34.239534 seconds (1.24 M allocations: 5.621 GiB)
 34.239525 seconds (1.24 M allocations: 5.621 GiB)
 34.239516 seconds (1.24 M allocations: 5.621 GiB)
 34.239504 seconds (1.24 M allocations: 5.621 GiB)
 34.239495 seconds (1.24 M allocations: 5.621 GiB)
 34.239483 seconds (1.24 M allocations: 5.621 GiB)
 34.239472 seconds (1.24 M allocations: 5.621 GiB)
 34.239459 seconds (1.24 M allocations: 5.621 GiB)
 34.239450 seconds (1.24 M allocations: 5.621 GiB)
 34.239442 seconds (1.24 M allocations: 5.621 GiB)
 34.239424 seconds (1.24 M allocations: 5.621 GiB)
 34.239410 seconds (1.24 M allocations: 5.621 GiB)
 34.239390 seconds (1.24 M allocations: 5.621 GiB)
 34.239376 seconds (1.24 M allocations: 5.621 GiB)
 34.239362 seconds (1.24 M allocations: 5.621 GiB)
 34.239335 seconds (1.24 M allocations: 5.621 GiB)
 34.239305 seconds (1.24 M allocations: 5.621 GiB)
 34.239266 seconds (1.24 M allocations: 5.621 GiB)
 34.417358 seconds (1.24 M allocations: 5.648 GiB)
 34.957085 seconds (1.27 M allocations: 5.730 GiB)
 39.520428 seconds (1.30 M allocations: 5.754 GiB)

This all suggests that HTTP.jl isnā€™t behaving as you would expect based on the connection_limit that gets set.

On the other hand, I have confirmed my python code is indeed opening 100 connections by default, so I decided to see if reducing the number of allowed open connections would make a difference. Turns out it does not! (And I verified this with lsof of course).

Modified code to change connection limit:

In [16]: keys = [f'analysed_sst/{i}.1.0' for i in range(100)]
    ...: In [25]: async def get_items(keys, limit=100):
    ...:     connector = aiohttp.TCPConnector(limit=limit)
    ...:     ...:     async with aiohttp.ClientSession(connector=connector) as session:
    ...:     ...:         tasks = []
    ...:     ...:         for ckey in keys:
    ...:     ...:             url = f"https://mur-sst.s3.us-west-2.amazonaws.com/zarr-v1/{ckey}"
    ...:     ...:             tasks.append(asyncio.create_task(get_cdata(session, url)))
    ...:     ...:         cdatas = await asyncio.gather(*tasks)
    ...:     ...:         return cdatas
    ...: 

In [17]: async def get_cdata(session, url):
    ...:     ...:     async with session.get(url) as resp:
    ...:     ...:         cdata = await resp.read()
    ...:     ...:         return cdata
    ...:     ...:
    ...: 

In [18]: %time cdatas = asyncio.run(get_items(keys, limit=50))
CPU times: user 3.41 s, sys: 2.25 s, total: 5.65 s
Wall time: 6.29 s

In [19]: %time cdatas = asyncio.run(get_items(keys, limit=25))
CPU times: user 3.12 s, sys: 1.57 s, total: 4.69 s
Wall time: 6.26 s

In [20]: %time cdatas = asyncio.run(get_items(keys, limit=10))
CPU times: user 3.3 s, sys: 1.18 s, total: 4.47 s
Wall time: 5.8 s

In [21]: %time cdatas = asyncio.run(get_items(keys, limit=10))
CPU times: user 3.37 s, sys: 1.22 s, total: 4.59 s
Wall time: 5.43 s
5 Likes

Note that you can set HTTP.jlā€™s default connection limit can be set via

using HTTP
HTTP.ConnectionPool.set_default_connection_limit!(100)
1 Like

Since there are too many possible causes that may lead to this problem (libuv, Juliaā€™s IO, or HTTP), and most of us donā€™t have a high-speed network, can you try some non-SSL tests and use only Juliaā€™s basic socket programming? For example:

For julia:

using Sockets
raw="""
GET /20MB.zip HTTP/1.0
Host: ipv4.download.thinkbroadband.com
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:109.0) Gecko/20100101 Firefox/110.0
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8
Accept-Language: en-US,en;q=0.5
Accept-Encoding: gzip, deflate
Connection: close
Upgrade-Insecure-Requests: 0
Content-Length: 0

"""
const req = replace(raw, '\n' => "\r\n")
struct Config
    ntask::Int
    nbatch::Int
end
function test(c::Config)
    host = Sockets.getaddrinfo("ipv4.download.thinkbroadband.com")
    @sync for i in 1:c.ntask
        @async begin
            for j in 1:c.nbatch
                s = TCPSocket()
                Sockets.connect!(s, host, 80)
                write(s, req)
                println("Send of $i:$j")
                bytes = read(s)
                println("Finish of $i:$j")
            end
            println("Finish of Group:$i")
        end
    end
end
test(Config(100#=ntask=#, 1#=nbatch=#))

For python:

import aiohttp
import asyncio
async def get_items(keys):
    async with aiohttp.ClientSession(auto_decompress=False) as session:
        tasks = []
        for _ in keys:
            url = f"http://ipv4.download.thinkbroadband.com/20MB.zip"
            tasks.append(asyncio.create_task(get_cdata(session, url)))
        cdatas = await asyncio.gather(*tasks)
        return cdatas

async def get_cdata(session, url):
    async with session.get(url) as resp:
        cdata = await resp.read()
        return cdata

cdatas = asyncio.run(get_items([i for i in range(100)]))

These two scripts will download 100 20MB files from the website ipv4.download.thinkbroadband.com (I randomly pick up one, I recommend you to try other sites). I have a relative slow speed network (10MB/s), and I found thereā€™s no bandwidth difference between these two scripts.

I suspect it is the problem of HTTP.jl. Based on current observation, python can process the request with fewer sockets and Julia can achieve the same speed with more threads. So libuv may not be the bottleneck here. It seems that HTTP.jl canā€™t handle all the responds in time, so the sender is forced to transfer packets slower, to prevent buffer overflow.

Yeah, it seems that HTTP.jl always wait on some tasks to be finished instead of spawning all the tasks together. Not quite sure what happened here.

1 Like