More on Debian Jessie/Ubuntu Trusty packet capture woes

Back in September I wrote about a problem we’d come across when capturing traffic with pcap_dispatch() or pcap_next_ex() on Ubuntu Trusty or Debian Jessie. When the traffic was slow, we saw packets not being captured.

We’ve since done a bit more digging. The problem, we think, is a bug in the Linux kernel select() system call. With both pcap_dispatch() and
pcap_next_ex() we’re using a central loop that is basically:

 select(pcapfd, timeout);

The length of timeout in the select() call shouldn’t matter. But it does. In our test scenario, set it to 1ms and every packet in a ping to an otherwise idle network connection will be captured. Set it to 2s and most or all will be missed.

Robert Edmonds has suggested that it’s this kernel bug. Thanks, Robert – that looks like the problem to us. This was fixed in kernel 3.19. We’ve filed a Debian bug and a Ubuntu bug.

So, what can you do about it for now?

  • If using Ubuntu Trusty, consider switching to the LTS Enablement Stack. This has the fix applied.
  • If using Debian Jessie, consider switching to a 4.9 series kernel from Jessie backports,
  • Otherwise consider reducing the timeout in your call to select(). As noted above, this certainly improves the situation for our specific test scenario. However, we can’t be confident that it is a definitive fix; make sure you test your particular circumstances.

Packet capture woes with libpcap on Ubuntu Trusty and Debian Jessie

Usually when you’re using libpcap to capture network traffic, your chief worry will be whether or not your application will keep up with the flow of traffic.

Today, though, I’ve stubbed my toe on a problem with traffic that’s too slow. It happens with both Ubuntu Trusty and Debian Jessie. If there’s a gap between packets of more than about 50 milliseconds, the first packet to arrive after the gap will be dropped and you’ll never see it. I was capturing DNS queries and responses, and found that with a query rate of under 20 queries per second you start dropping queries. By the time you’re down to 15 queries per second, nearly every query is dropped.

After spotting that tcpdump doesn’t have this problem, and much experimentation later, it’s not quite as simple as that. Whether or not you drop packets depends on the libpcap API you are using. If you’re using pcap_loop() to capture packets, you can stop worrying. This works properly. I guess that tcpdump is using pcap_loop() to capture packets and that’s why it works.

If, on the other hand, you’re using pcap_dispatch() or pcap_next_ex(), as the documentation urges you to do, than you’re doomed. This is regardless of whether you are using blocking or non-blocking mode.

So, what can you do? Your choices are limited.

  1. Switch your application to using pcap_loop(). If you were using non-blocking mode with either pcap_dispatch() or pcap_next_ex(), this will be non-trivial, as pcap_loop() doesn’t observe non-blocking, but always blocks. It won’t be straightforward either if you’re using pcap_next_ex() in your own loop.
  2. Upgrade. The problem is fixed if you upgrade Ubuntu to Xenial. I also found the problem apparently fixed by updating Jessie to the 4.7.0 kernel in Debian Backports.