I want to make a large number of REST calls to download files. Each call is slow and I can only make a certain number of concurrent downloads (say N).
I would like to max out on the number of concurrent calls. I imagine making a new call as soon as one of the existing calls finishes.
From my limited understanding of Channels they sound like a way to go, but cannot figure out how to implement my idea. I have looked at ThreadPools and topics like this one.
Currently it is only downloading and not parsing, so I think coroutines are sufficient.
Any input is appreciated. Let me know if I should provide more info.
What is a ‘large number’ ? Which OS Are you using - if Linux/Unix check on your limits.
There are limits on the max open files (connections) on a system. I learnt this the hard way (long story)
You can tune these limits higher, and also alter the values of tcp_tw_recycle and tcp_tw_reuse which mean a closed connection can be recycled faster.
import Downloads
urls = ["https://julialang.org" for _ in 1:100]
max_concurrent = 4
jobs = Channel(max_concurrent)
@sync begin
@async begin
# create the jobs
foreach(url -> put!(jobs, url), urls)
close(jobs)
end
for _ in 1:max_concurrent
# spawn max_concurrent tasks
@async for url in jobs
r = Downloads.request(url)
@show r.status
end
end
end
Thanks for you answer – it looks like it sort of does what I want
But it appears that it makes chunks of max_concurrent async calls and then waits until they are all done before moving on to the next chunk, at least if I add a sleep:
@async for url in jobs
sleep(1)
r = Downloads.request(url)
@show r.status
end
Is it possible to pick up entry max_concurrent + 1 from jobs as soon as the first download succeeds?
Also, I don’t understand why the second loop
for _ in 1:max_concurrent
# spawn max_concurrent tasks
@async for url in jobs
r = Downloads.request(url)
@show r.status
end
end
actually behaves the way it does. Omitting the outer loop and only keeping
@async for url in jobs
r = Downloads.request(url)
@show r.status
end
it appears to download all the urls in one go. How is the outer loop alleviating this? It would be great if you can elaborate on this
Edit: Even if asyncmap can solve the problem at hand I would still love to learn more about tasks/channels if you have time to explain