Sunday, January 10, 2010

PyMongo for Dummies (using Squid logs, again)

In my last blog I showed some examples form the MongoDB shell. Next, we'll go through the PyMongo API, since only crazy people code in JavaScript.

In [3]: c = pymongo.Connection("192.168.169.62")
In [4]: db = c.mongosquid
In [5]: raw = db.raw
In [6]: raw
Out[6]: Collection(Database(Connection('192.168.169.62', 27017), u'mongosquid'), u'raw')

We could have also referred to our collection as db["raw"] or db[coll] if you needed to define the collection in a variable.

In [7]: raw.count()
Out[7]: 205339

You can find out the methods that belong to the database with the collection_names() method.

In [40]: db.collection_names()
Out[40]: [u'raw', u'system.indexes']

The find_one() method allows you to quickly inspect your collection and take a peek at a sample document.

In [10]: raw.find_one()

Out[10]:

{u'_id': ObjectId('4b496cddb15cb004a4000000'), u'format': u'-', u'method': u'GET', u'size': 824477.0, u'source': u'192.168.1.254', u'squidcode': u'TCP_MISS/200', u'stamp': 1263096815.7609999, u'url': u'http://netflix086.as.nflximg.com.edgesuite.net/sa0/166/1680180166.wmv/range/660083845-660907844?'}

The distinct() method does have some limitations, as I discovered the hard way, as you an see from this exception.

In [13]: raw.distinct("stamp") --------------------------------------------------------------------------- OperationFailure Traceback (most recent call last) /root/ /usr/lib/python2.4/site-packages/pymongo-1.3-py2.4-linux-i686.egg/pymongo/collection.pyc in distinct(self, key) /usr/lib/python2.4/site-packages/pymongo-1.3-py2.4-linux-i686.egg/pymongo/cursor.pyc in distinct(self, key) /usr/lib/python2.4/site-packages/pymongo-1.3-py2.4-linux-i686.egg/pymongo/database.pyc in _command(self, command, allowable_errors, check, sock)

OperationFailure: command SON([('distinct', u'raw'), ('key', 'stamp')]) failed: assertion: distinct too big, 4mb cap

So in my previous blog (using JavaScript) I introduced queries but you really can't do anything useful without using a cursor. If you've ever done any MySQL coding before you should be familiar with the concept. Basically it allows you to iterate through the results of a query.

Here we have the same expressions but you obviously need to quote the gt in Python.

In [29]: c = raw.find( {'stamp': { "$gt": 1263096815 }})
In [31]: c.count()
Out[31]: 2060


and

In [23]: c = raw.find({'squidcode':'TCP_DENIED/403'})
In [24]: c.count()

Out[24]: 2999


For the sake of this exercise, we only want to see 3 results so we call the limit() method.

In [26]: c.limit(3)

Out[26]:


Now we can iterate through the results of our query.

In [27]: for e in c:
....: print e
....:
....:

{u'squidcode': u'TCP_DENIED/403', u'format': u'-', u'stamp': 1262520969.721, u'source': u'192.168.1.254', u'url': u'http://www.bing.com/favicon.ico', u'_id': ObjectId('4b496ea4b15cb004a6000000'), u'method': u'GET', u'size': 1419.0}

{u'squidcode': u'TCP_DENIED/403', u'format': u'-', u'stamp': 1262521126.928, u'source': u'192.168.1.254', u'url': u'http://www.msn.com/', u'_id': ObjectId('4b496ea4b15cb004a600003e'), u'method': u'GET', u'size': 1395.0}

{u'squidcode': u'TCP_DENIED/403', u'format': u'-', u'stamp': 1262521127.654, u'source': u'192.168.1.254', u'url': u'http://www.bing.com/favicon.ico', u'_id': ObjectId('4b496ea4b15cb004a600003f'), u'method': u'GET', u'size': 1419.0}

So if we try again, what happens?

In [28]: for e in c:

print e
....:
....:

Nada. We have to rewind the cursor object to be able iterate again.

In [30]: c.rewind()
Out[30]:
In [31]: for e in c:
print e ....: ....:

{u'squidcode': u'TCP_DENIED/403', u'format': u'-', u'stamp': 1262520969.721, u'source': u'192.168.1.254', u'url': u'http://www.bing.com/favicon.ico', u'_id': ObjectId('4b496ea4b15cb004a6000000'), u'method': u'GET', u'size': 1419.0}

You can also manually iterate through these by calling next()

In [51]: cr.next()

Out[51]:

{u'_id': ObjectId('4b496ea4b15cb004a6000000'),
u'format': u'-', u'method': u'GET', u'size': 1419.0, u'source': u'192.168.1.254', u'squidcode': u'TCP_DENIED/403', u'stamp': 1262520969.721, u'url': u'http://www.bing.com/favicon.ico'}

In [52]: result = cr.next()


Guess what, your limit will still apply so if you want to clear it you can do a cr.rewind() and cr.limit(0) and then you can manually iterate through with cr.next()

No comments: