Sunday, February 13, 2011

Large Scale Packet Dump Analysis with MongoDB

So when dealing with hundreds of MBs (or even several GBs) of packet captures spread across dozens of files, wireshark sort of breaks down, even with a fast CPU. In my case I have a laptop (with LUKS encrypted drive) so it is pretty slow. Yeah you can split them into smaller files but then you lose visibility into the complete picture when performing your queries. I think you could also write some .lua wireshark but you still have the bottleneck of tshark. So what to do? Let's back up a bit.

Over the years I've written a variety of Perl, Python, or Ruby scripts for processing .pcaps. Some that use C (or pure Python) .pcap parsers--or when I first started doing this over a decade ago just parsing the output of tcpdump and building hashes or dictionaries. Not only is this slow but you have to persist your hashes via pickling. And the challenge of the pcap libraries is they typically don't have any application layer decoding and they require a C version of the library which isn't very cross platform.

Enter pdml. I first described this in a Digital Bond blog post back in 2006. I can remember doing this on a lowly Powerbook G4 and it just worked. A year ago I was looking for a project to use MongoDB. So I wrote some code to automate the process of creating the .pdml files using wireshark and extracting the fields of interest and inserting them into a MongoDB database. I have a configuration file that specifies which PDML fields I want to extract.

type = decimal

size = 8
type = decimal

size = 4
type = ipaddr

size = 4
type = ipaddr

size = 16
type = decimal

Because MongoDB doesn't support periods in key names (I learned this the hard way last year) I change the name of the field from ip.src to ip_src. Anything that wireshark knows about I can extract and it will become a key for that packet.

While I was still importing packets (I had close to 2 million packets in the database) I issue a query to see the unique source IPs. I can do this for any field that is supported in the PDML.

>> db.raw.distinct("ip_src")

Sun Feb 13 11:20:05 [conn17] query art0.$cmd ntoreturn:1 command: { distinct: "raw", key: "ip_src", query: {} } reslen:4457 1607ms

Or let's say I wanted to look at what are the unique TTLs.

>> db.raw.distinct("ip_ttl")

11:31:17 [conn17] query art0.$cmd ntoreturn:1 command: { distinct: "raw", key: "ip_ttl", query: {} } reslen:452 1779ms

Not bad for 2.7 million packets (and counting).

This example is pretty uninteresting because it is just standard TCP/IP headers and if you just wanted session data you could just use netflow but this far more flexible.

The downside is speed of import. Creation of the .pdml file by running tshark is very slow and the parsing of XML in Python is also not the speediest. I'm up to close to 3.1 million packets in about an hour that I've successfully imported into my database, but once they are in it is lightning fast and you are free. Where I'm (hopefully) headed today is some scripts that will create Graphviz representations of all the communications of interest perhaps like those available with Afterglow. Or I can use this to analyze and reconstruct streams from some of the proprietary protocols that I was most interested in. Or I can use this as an exercise to write a Node.js app to browse this data. The point is getting it into a useful database that allows flexible and fast queries and offloads a lot of manual tasks I would normally have to do.