Limited parallel downloads

I want to make a large number of REST calls to download files. Each call is slow and I can only make a certain number of concurrent downloads (say N).

I would like to max out on the number of concurrent calls. I imagine making a new call as soon as one of the existing calls finishes.

From my limited understanding of Channels they sound like a way to go, but cannot figure out how to implement my idea. I have looked at ThreadPools and topics like this one.
Currently it is only downloading and not parsing, so I think coroutines are sufficient.

Any input is appreciated. Let me know if I should provide more info.

Thanks!

1 Like

What is a ‘large number’ ? Which OS Are you using - if Linux/Unix check on your limits.

There are limits on the max open files (connections) on a system. I learnt this the hard way (long story)
You can tune these limits higher, and also alter the values of tcp_tw_recycle and tcp_tw_reuse which mean a closed connection can be recycled faster.

Here is one example:

import Downloads

urls = ["https://julialang.org" for _ in 1:100]
max_concurrent = 4
jobs = Channel(max_concurrent)

@sync begin
    @async begin
        # create the jobs
        foreach(url -> put!(jobs, url), urls)
        close(jobs)
    end
    for _ in 1:max_concurrent
        # spawn max_concurrent tasks
        @async for url in jobs
            r = Downloads.request(url)
            @show r.status
        end
    end
end
3 Likes

Doesn’t asyncmap accomplish something similar?

asyncmap(urls, ntasks=max_concurrent) do u
                                        r = Downloads.request(u)
                                        @show r.status
                                        r
                                      end

Thanks for you answer – it looks like it sort of does what I want :slight_smile:

But it appears that it makes chunks of max_concurrent async calls and then waits until they are all done before moving on to the next chunk, at least if I add a sleep:

        @async for url in jobs
            sleep(1)
            r = Downloads.request(url)
            @show r.status
        end

Is it possible to pick up entry max_concurrent + 1 from jobs as soon as the first download succeeds?

Also, I don’t understand why the second loop

    for _ in 1:max_concurrent
        # spawn max_concurrent tasks
        @async for url in jobs
            r = Downloads.request(url)
            @show r.status
        end
    end

actually behaves the way it does. Omitting the outer loop and only keeping

        @async for url in jobs
            r = Downloads.request(url)
            @show r.status
        end

it appears to download all the urls in one go. How is the outer loop alleviating this? It would be great if you can elaborate on this :slight_smile:

Edit: Even if asyncmap can solve the problem at hand I would still love to learn more about tasks/channels if you have time to explain :slight_smile:

Probably 10_000 calls and my number of max concurrent calls are in the 10s. I don’t know if this gives problems with the OS (Win/Linux).

From the docs it appears to do exactly what I want – thanks!

10 concurrent calls will be fine.