Compressing pcap files

Here at Sinodun Towers, we’re often dealing with pcap DNS traffic capture files created on a far distant server. These files need to be compressed, both to save space on the server, and also to speed transfer to our machines.

Traditionally we’ve used gzip for quick but basic compression, and xz when space was more important than CPU and we really needed the best compression we could get. At a recent conference, though, an enthusiastic Facebook employee suggested we take a look at zstd. We’d looked at it quite some time ago, but our contact said it has improved considerably recently. So we thought we’d compare the latest zstd (1.2.0) with current gzip (1.8) and xz (5.2.3) and see how they stack up when you’re dealing with pcap DNS traffic captures.

Compressing

We took what is (for us) a big file, a 662Mb DNS traffic capture, and timed compressing it at all the different compression levels offered by each compressor. We did three timed runs for each and averaged the time. Here’s the results. Each point on the graph is a compression level.

zstd turns in an impressive performance. For lower compression levels it’s both notably quicker than gzip and far more effective at compressing pcap DNS traffic captures. In the same time gzip can compress the input pcap to 25% of its original size, zstd manages 10% of the original size. Put another way, in our test the compressed file size is 173Mb for gzip versus 65Mb for zstd at similar runtimes.

zstd is also competitive with xz at higher compression levels, though xz does retain a slight lead in file size and runtime at higher compression levels.

Decompressing

Of course, being able to compress is only half the problem. If you’re collecting data from a fleet of servers and bringing that data back to a central system for analysis, you may well find that decompressing your files becomes your main bottleneck. So we also checked decompression times.

There’s little to choose between zstd and gzip at any compression level, while xz generally lags.

Resource usage

So, if zstd gives better compression in similar times, what other costs does it have over gzip? The short answer there is memory. Our measurements show that while gzip has much the same working set size regardless of compression level, zstd working sets begin an order of magnitude larger and increases; by the time zstd is competing with xz, its working set size is up to nearly 3x the size of xz.

That being said, by modern standards gzip‘s working set size is absolutely tiny, comparable to a simple ls command. You can very probably afford to use zstd. As ever with resource usage, you pays your money and you takes your choice.

Conclusion

It looks to us that if you’re currently using gzip to compress pcap DNS traffic captures, then you should definitely look at switching to zstd. If you are going for higher compression, and currently using xz, the choice is less clear-cut, and depends on what compression level you are using.

A note on the comparison

We generated the above numbers using the standard command line tools for each compressor. Some capture tools like to build compression into their data pipeline, typically by passing raw data through a compression library before writing out. While attractive for some use cases, we’ve found that for higher compression you risk having the compression becoming the processing bottleneck. If server I/O load is not an issue (which it is not for many dedicated DNS servers), we prefer to write temporary uncompressed files and compress these once they are complete. Given sufficient cores, this allows you to parallelise compression, and employ much more expensive – but effective – compression than would be possible with inline compression.

Spread the word. Share this post!