Limited parallel downloads

robertdj · November 8, 2021, 12:52pm

I want to make a large number of REST calls to download files. Each call is slow and I can only make a certain number of concurrent downloads (say N).

I would like to max out on the number of concurrent calls. I imagine making a new call as soon as one of the existing calls finishes.

From my limited understanding of Channels they sound like a way to go, but cannot figure out how to implement my idea. I have looked at ThreadPools and topics like this one.
Currently it is only downloading and not parsing, so I think coroutines are sufficient.

Any input is appreciated. Let me know if I should provide more info.

Thanks!

johnh · November 8, 2021, 2:21pm

What is a ‘large number’ ? Which OS Are you using - if Linux/Unix check on your limits.

There are limits on the max open files (connections) on a system. I learnt this the hard way (long story)
You can tune these limits higher, and also alter the values of tcp_tw_recycle and tcp_tw_reuse which mean a closed connection can be recycled faster.

fredrikekre · November 8, 2021, 2:49pm

Here is one example:

import Downloads

urls = ["https://julialang.org" for _ in 1:100]
max_concurrent = 4
jobs = Channel(max_concurrent)

@sync begin
    @async begin
        # create the jobs
        foreach(url -> put!(jobs, url), urls)
        close(jobs)
    end
    for _ in 1:max_concurrent
        # spawn max_concurrent tasks
        @async for url in jobs
            r = Downloads.request(url)
            @show r.status
        end
    end
end

lbilli · November 8, 2021, 3:57pm

Doesn’t asyncmap accomplish something similar?

asyncmap(urls, ntasks=max_concurrent) do u
                                        r = Downloads.request(u)
                                        @show r.status
                                        r
                                      end

robertdj · November 8, 2021, 3:59pm

Thanks for you answer – it looks like it sort of does what I want

But it appears that it makes chunks of max_concurrent async calls and then waits until they are all done before moving on to the next chunk, at least if I add a sleep:

        @async for url in jobs
            sleep(1)
            r = Downloads.request(url)
            @show r.status
        end

Is it possible to pick up entry max_concurrent + 1 from jobs as soon as the first download succeeds?

Also, I don’t understand why the second loop

    for _ in 1:max_concurrent
        # spawn max_concurrent tasks
        @async for url in jobs
            r = Downloads.request(url)
            @show r.status
        end
    end

actually behaves the way it does. Omitting the outer loop and only keeping

        @async for url in jobs
            r = Downloads.request(url)
            @show r.status
        end

it appears to download all the urls in one go. How is the outer loop alleviating this? It would be great if you can elaborate on this

Edit: Even if asyncmap can solve the problem at hand I would still love to learn more about tasks/channels if you have time to explain

robertdj · November 8, 2021, 4:00pm

Probably 10_000 calls and my number of max concurrent calls are in the 10s. I don’t know if this gives problems with the OS (Win/Linux).

robertdj · November 8, 2021, 4:01pm

From the docs it appears to do exactly what I want – thanks!

johnh · November 8, 2021, 4:09pm

10 concurrent calls will be fine.