Sunday, January 10, 2010

Dummies Guide to MongoDB Queries using Squid Logs (JavaScript Shell Edition)

So the MongoDB develop documentation is actually pretty decent, but it doesn't really use examples with real data. For me, it made it more difficult for some of the API and shell commands to sink in.

So to generate some real world queries I created a python script that parsed the access.log file[s] generated by squid. I'll follow this blog with one that covers pymongo but I think this will be helpful, and like most of the posts will provide a good reference because when you are rapidly approaching 40 not only your eyes go, but your memory. So here goes...

First of all this assumes you are running the mongo JavaScript shell and yeah I know running from root is a bad idea and not even necessary (I don't think) but sue me.

root@opti620:~/mongodb# ./bin/mongo
MongoDB shell version: 1.2.1
url: test
connecting to: test
type "help" for help
> show dbs
> use mongosquid
switched to db mongosquid
> show collections

Now let's have some fun. This was actually when I just imported a few lines in from the log file so there are a relatively small number of documents. A collection is essentially like a table but since this is #nosql it really isn't a table. It is just collection of documents. We'll see those next.

> db.raw.find().count()
> db.raw.find()[1029]
> db.raw.find()[1028]
"_id" : ObjectId("4b496cddb15cb004a4000404"),
"squidcode" : "TCP_MISS/200",
"source" : "",
"stamp" : 1263102993.841,
"format" : "-",
"url" : "",
"method" : "CONNECT",
"size" : 17499

The JSON above is the "document." Something you'll notice is there are two different data types basically strings and floating points. The size field and timestamp are obviously floats. That hash looking thing is actually a hash or GUID that is supposedly unique.

So one of the cool built in queries is to return only the unique values for a given field. This is handled by the distinct method.

So we can see here that there were HTTP Posts.

> db.raw.distinct("method")
[ "CONNECT", "GET" ]

And because of my screwed up natting I can't tell which of my kids was going to netflix.

> db.raw.distinct("source")
[ "" ]

> db.raw.distinct("url")

So remember when I discussed types above, if we wanted to retrieve all the transactions that were greater than 1MB we could do the following, but there are obviously more to it than that.

> db.raw.find( {size: { $gt:1000000}} )
{ "_id" : ObjectId("4b496cddb15cb004a4000162"), "squidcode" : "TCP_MISS/200", "source" : "", "stamp" : 1263097489.996, "format" : "-", "url" : "", "method" : "GET", "size" : 1008478 }
{ "_id" : ObjectId("4b496cddb15cb004a40003b0"), "squidcode" : "TCP_MISS/200", "source" : "", "stamp" : 1263099100.207, "format" : "-", "url" : "", "method" : "GET", "size" : 1008478 }

I was pleased to find that you can use regular expressions. The first query tells me there are 3199 documents that have port 443 in them and the 2nd query returns the first document. One of the things I noticed is that retrieving the document based on the "index" is really really slow. But I believe that is because it isn't really an index, but we'll get to them later.

> db.raw.find ( { url: /:443/ }).count()
> db.raw.find ( { url: /:443/ })[0]
"_id" : ObjectId("4b496cddb15cb004a4000093"),
"squidcode" : "TCP_MISS/200",
"source" : "",
"stamp" : 1263096929.091,
"format" : "-",
"url" : "",
"method" : "CONNECT",
"size" : 96222
> db.raw.find ( { url: /:443/ })[0:3]
Sun Jan 10 01:16:11 JS Error: SyntaxError: missing ] in index expression (shell):0

You'll notice that array slices don't work, but they do in Python, obviously which I'll blog on next.

No comments: