Bottleneck when receiving UDP packets?

tamasgal · June 15, 2020, 8:52am

I hope some network experts can help me here, this is probably not-so-Julia-related.

My problem is that I am unable to receive more than ~600 UDP packets per second (on my MacBook, on my Xeon it’s ~900/s) however, I am able to send several thousands per seconds with a fixed rate.

I am wondering: is this an intrinsic limit of my network/hardware or am I overlooking some settings?

To spam UDP packets with a fixed rate, I use this script, which will print a . whenever it failed to send a packet “on-time”, just to see if it can keep up with a constant rate. I noticed that I can do up to 20000 UDP packets (on my Xeon) with only a few dots per second, to exclude that bottleneck.

#!/usr/bin/env julia
using Sockets

if length(ARGS) < 1
    println("Usage ./send.jl PACKETS_PER_SECOND")
    exit(1)
end

function main()
    target = ip"127.0.0.1"
    port = 10000
    data = rand(UInt8, 244)
    packets_per_second = parse(Int, ARGS[1])

    sock = UDPSocket()

    send_data(sock, target, port, data, packets_per_second)
end

function send_data(sock, target, port, data, packets_per_second)
    delta_t = 1 / packets_per_second
    stopat = time() + delta_t
    while true
        now = time()
        send(sock, target, port, data)
        pausefor = now + delta_t - time()
        if pausefor < 0
            # error("Can't keep up with UDP packet rate ($pausefor s)")
            print(".")
            continue
        end
        sleep(pausefor)
    end
end

main()

Running this on my Xeon with ./send.jl 20000 (on my MacBook with 2000) and then executing the following script to count the UDP packets for a given time period:

#!/usr/bin/env julia
using Sockets

if length(ARGS) < 1
    println("Usage: udp.jl TIMEOUT")
    exit(1)
end

function main()
    println("Setting up UDP connection")
    sock = UDPSocket()
    bind(sock, ip"0.0.0.0", 10000)

    timeout = parse(Int64, ARGS[1])

    println("Counting UDP packets for $timeout seconds")
    n = countpackets(sock, timeout)
    println("$n UDP packets recieved")
end

function countpackets(sock, timeout)
    stopat = time() + timeout
    n = 0
    while time() < stopat
        data = recv(sock)
        n += 1
    end
    n
end

main()

yields:

░ tamasgal@greybox.local:~/tmp/udp took 11s
░ 10:49:10 > ./receive.jl 10
Setting up UDP connection
Counting UDP packets for 10 seconds
6879 UDP packets recieved

I also tried increasing the socket buffer size up to the maximum allowed value using the following code but it did not help:

    arg = Ref{Cint}(...) # this value is doubled on Linux
    Base.uv_error("buffer size",ccall(:uv_recv_buffer_size, Cint, (Ptr{Cvoid}, Ptr{Cint}), sock.handle, arg))

Any ideas why it hits this limit? In Python I see similar numbers btw.

tamasgal · June 15, 2020, 11:37am

I also tried a multi-threaded approach with 4 threads, which increased the number of packets per second from 600 to around 800 on my MacBook but no increase on my Xeon, using this script:

#!/usr/bin/env julia
using Sockets

if length(ARGS) < 1
    println("Usage: udp.jl NUMBER_OF_PACKETS")
    exit(1)
end


function main()
    println("Setting up UDP connection")
    sock = UDPSocket()
    bind(sock, ip"0.0.0.0", 10000)

    n = parse(Int64, ARGS[1])

    println("Counting $n UDP packets on $(Threads.nthreads()) threads")
    count_udp(sock, 10)  # warm-up
    @time begin
        count_udp(sock, n)
    end
end


function count_udp(sock, n)
    Threads.@threads for i in 1:n
        data = recv(sock)
    end
end

main()

Here is the output running on my Mac:

░ tamasgal@greybox.local:~/tmp/udp took 3s
░ 13:33:22 > JULIA_NUM_THREADS=4 ./receive_multi_threaded.jl 2000
Setting up UDP connection
Counting 2000 UDP packets on 4 threads
  2.395930 seconds (23.97 k allocations: 1.039 MiB)

tamasgal · June 15, 2020, 11:39am

Btw. I repeated the test also using C++/boost and even with Wireshark. I get the same numbers over and over:

#include <boost/asio.hpp>
#include <boost/array.hpp>
#include <boost/bind.hpp>
#include <thread>
#include <iostream>

#define IPADDRESS "0.0.0.0"
#define UDP_PORT 10000

using boost::asio::ip::udp;
using boost::asio::ip::address;

struct Client {
    boost::asio::io_service io_service;
    udp::socket socket{io_service};
    boost::array<char, 1024> recv_buffer;
    udp::endpoint remote_endpoint;

    int count = 20000;

    void handle_receive(const boost::system::error_code& error, size_t bytes_transferred) {
        if (--count > 0) {
            wait();
        }
    }

    void wait() {
        socket.async_receive_from(boost::asio::buffer(recv_buffer),
            remote_endpoint,
            boost::bind(&Client::handle_receive, this, boost::asio::placeholders::error, boost::asio::placeholders::bytes_transferred));
    }

    void Receiver()
    {
        socket.open(udp::v4());
        socket.bind(udp::endpoint(address::from_string(IPADDRESS), UDP_PORT));

        wait();

        std::cout << "Starting UDP counting\n";
        io_service.run();
        std::cout << "Done\n";
    }
};

int main(int argc, char *argv[])
{
    Client client;
    std::thread r([&] { client.Receiver(); });

    r.join();
}

compiled with g++ -pthread boost_udp.cpp.

And here is a Wireshark sniff, running on my MacBook:

zsoerenm · June 15, 2020, 11:43am

We use the Software Defined Radio (SDR) platform from Ettus, that transfers data with a high bandwidth over the network. The data is sent over the UDP protocol. Here are some suggestions to increase the bandwidth: USRP Hardware Driver and USRP Manual: System Configuration for USRP X3x0 Series
It also mentions that not all network hardware is capable of achieving high bandwidth. So you might have to procure a different network card.

tamasgal · June 15, 2020, 12:05pm

Thanks, I also tried increasing the OS limits, as mentioned in the docs you sent me (sudo sysctl -w net.core.rmem_max=33554432 etc.) but it did not help.

I’ll dig further, meanwhile maybe someone has a clue. I am really wondering how I can send with a much higher rate, but not able to receive…

tamasgal · June 15, 2020, 12:47pm

Alright I just tried on our DAQ system and it is indeed related to this setting net.core.rmem_max since now I am able to keep up with the UDP rates of from many sources!

I am however not able to locally reproduce these rates, which means, I am able to receive more than 2000 UPD packets on our target system from many (external) network sources, but I am reproduce this via the local loopback.

Anyways, thanks for point it out, sometimes one just needs to talk about it a bit more

dlakelan · June 15, 2020, 12:55pm

what kind of network set up are you using, is there anything more than just a switch between the sender and receiver? QoS or firewall systems could cause the network to actually drop the packets.

tamasgal · June 15, 2020, 1:03pm

We are using a fiber-optic network system with up to 2070 nodes with each having a 1Gbps connection and multiple 10 Gbps DFES uplinks. So the throughput is well handled. This UDP data is just a tiny fraction of the overall data. The server itself where I analyse the UDP packets in realtime is connected with a 10GbE NIC.

I hit the limit of around 1000 UDP packets/s when we attached more nodes to the network (currently we have 114 out of 2070 attached, each sending with a rate of 10Hz), so I started to investiage on my own machines.

However, as written above, increasing the net.core.rmem_max helps.

It is just a bit annoying that you can set higher values through the uv_lib package without any errors and those are just ignored. So I thought I am already using large buffer where instead I was using the same default buffer size all the time.

dlakelan · June 15, 2020, 1:16pm

1000pps is ridiculously low. I have a raspberry pi 4 that is doing QoS on my home network that is easily handling 1Gbps. At 1500Byte MTU, that’s ~83000 packets per second. of course it’s TCP in those tests. but the point is you are definitely nowhere near network hardware or inherent OS limits.

On the other hand, for example, the switching hardware may see DSCP tags on these packets indicating high priority and then deliver them with high priority… but high priority is sometimes limited to a tiny fraction of available bandwidth. so I was considering the idea that QoS was involved.

pixel27 · June 15, 2020, 1:17pm

On linux a time slice is usually around 100ms (not sure what it is on a mac). So if you are sending 1000 packets per second you are basically waking up sending a packet sleeping for a 1ms then sending another packet. Which is probably playing hell with the scheduler.

Something I would try is going into a while loop until it’s time to send the next package. Yes you will will use 100% of a core, but you shouldn’t have much scheduler overhead. A second thing to try is send the packets in bursts of 100 or something, So send 100, sleep for somewhat less than 100ms send the next 100.

tamasgal · June 15, 2020, 1:19pm

Yes I see. I was also surprised and it really annoyed me for a few days (that’s why I desperately asked for help here).

I still do not understand why the default settings of my Linux machine are yielding such a “poor performance” and also do not see how the buffer size is that much related. I thought the GC time might drop some packages but not at these ridiculously low rates. I can handle TCP/IP data with Julia in realtime with orders of magnitudes higher rates, but these low UDP packet rates (with a fixed size of 244) simply didn’t want to be processed

tamasgal · June 15, 2020, 1:20pm

I that makes more sense, I can try that to simulate the traffic on the local loopback. Thanks!

dlakelan · June 15, 2020, 1:33pm

Debian kernels were using 1000Hz timers until a few years ago and then switched to tickless if I remember correctly. but even at old school 100Hz the timeslice is only 10ms not 100.

if the process can run every 10ms then 1000pps * .01s = 10 packets x 244bytes = 2.4kB of buffer. it seems weird that this would be a real limit.

pixel27 · June 15, 2020, 1:45pm

Yes I thought it was around 10 or 20ms…but when I googled to be sure:

https://stackoverflow.com/questions/16401294/how-to-know-linux-scheduler-time-slice

They where saying 100ms then digging into:

https://man7.org/linux/man-pages/man2/sched_rr_get_interval.2.html#NOTES

That says the LInux “quantum” is 0.1 seconds. On my machine:

[pixel27@devil ~]$ cat /proc/sys/kernel/sched_rr_timeslice_ms 
90

So maybe 90ms for me?

dlakelan · June 15, 2020, 2:03pm

that is for the real-time round robin scheduler.

https://stackoverflow.com/questions/16401294/how-to-know-linux-scheduler-time-slice

gives more discussion. most processes will be scheduled on the CFS scheduler. latency target default there is 6ms

so unless you have saturated all the cores with real-time scheduled tasks it would be rare to have more than 10 or 20ms of latency for a well behaved user process (not swapping etc)

tamasgal · June 15, 2020, 2:25pm

I am still confused.

While after having set net.core.rmem_max=33554432, I am able to receive and process all the UDP packets (with a rate of > 1000Hz) on our DAQ system (before I had significant loss and only got 600Hz), but I still fail to do so on my own machine.

I set the same net.core.rmem_max value but I still can only receive ~900Hz while sending with a rate of 20kHz (using the scripts above, and also setting the socket buffer size to 33554432).

I don’t see the connection to the scheduler yet

dlakelan · June 15, 2020, 3:02pm

The machine you’re testing on, is it on the same network as the DAQ system? What network is it on? It sounds like you’re using a mac laptop for testing? Are you on wifi?

tamasgal · June 15, 2020, 3:09pm

The machine I am testing on is completely separated from the DAQ system. I am using 127.0.0.1 to send and receive, so it’s the local loopback device.

Edit: I am literally just running the two scripts above, on the same machine.

dlakelan · June 15, 2020, 4:04pm

!!!

Hunh. clearly that rules out all network hardware, and it does seem like it’s probably a kernel limitation. But are you running Linux or MacOS on this machine? (never mind, I see the screenshot is clearly MacOS). If MacOS hard to know what you could do.

Confirm that with your two original scripts, receiving for 10 seconds I only get 7000 packets or so, on a Linux x86 machine with plenty of rmem_max (50MB)

tamasgal · June 15, 2020, 7:08pm

I tried on both machines. On macOS the receiving limit is around 600Hz, on my Linux around 900Hz.

You can simply try that on your own machine. I really have no clue

Topic		Replies	Views
Advice needed on multithreaded processing for live data from a UDP socket General Usage multithreading , garbage-collection , sockets , realtime	2	571	January 13, 2023
UDP receive and send with julia New to Julia	17	3137	June 14, 2021
Help with real-time performance needed Performance garbage-collection , sockets , threads , makie	20	1510	September 25, 2022
Debugging Julia HTTP Package Performance bottleneck Performance	1	761	July 8, 2020
Julia - not able to capture looped back UDP packets in Wireshark General Usage question	2	388	January 19, 2022

Bottleneck when receiving UDP packets?

Related topics