Wednesday, January 13, 2010

Hello MongoDB (Jython Style)

It has been ages since I've played around with any of the Java scripting languages so I thought I'd give Jython a spin with MongoDB. I have no idea about the performance between the pure Python vs. Java driver but it would be an interesting benchmark.

This is a very quick code snippet based on the MongoDB Java tutorial.

This was done on Ubuntu 9.10 with OpenJDK in the standard repositories and assumes the jython shell script is in your path. It also assumes the Java MongoDB driver is in your path and I was lazy so I didn't bother with CLASSPATH.

#!/usr/bin/env jython
import sys
sys.path.append("mongo-1.2.jar")
from com.mongodb import *
print "Jython MongoDB Example"
m = Mongo("10.0.0.33")
db = m.getDB("grid_example")

for c in db.getCollectionNames():
print c

And the output is just what you'd expect.

mfranz@karmic-t61:~/Documents/mongo$ ./jymongo.py
Jython MongoDB Example
fs.chunks
fs.files
system.indexes

Avoiding Bracket Hell in MongoDB Queries (Python Style)

To me it wasn't immediately obvious from the MongoDB Advanced Query documentation that you can string together multiple operators to perform existence, membership, and greater/than that tests. And since JSON can get very messy (and long!) and the syntax is slightly different from the Javascript in the documentation, instead of passing JSON directly to the find method of your collection pass a dictionary and assign the various conditions

For example:

myq = {}
myq["batchstamp"] = b # a timestamp
myq["modbus_tcp_reference_num"] = {"$exists": True}
cur = coll.find( myq )

Although it doesn't appear much easier than passing

{'modbus_tcp_reference_num': {'$exists': True}, 'batchstamp': 999999999}

Once start adding additional conditions (themselves which may have dictionaries it is much easier and less error prone. Trust me!


Sunday, January 10, 2010

PyMongo for Dummies (using Squid logs, again)

In my last blog I showed some examples form the MongoDB shell. Next, we'll go through the PyMongo API, since only crazy people code in JavaScript.

In [3]: c = pymongo.Connection("192.168.169.62")
In [4]: db = c.mongosquid
In [5]: raw = db.raw
In [6]: raw
Out[6]: Collection(Database(Connection('192.168.169.62', 27017), u'mongosquid'), u'raw')

We could have also referred to our collection as db["raw"] or db[coll] if you needed to define the collection in a variable.

In [7]: raw.count()
Out[7]: 205339

You can find out the methods that belong to the database with the collection_names() method.

In [40]: db.collection_names()
Out[40]: [u'raw', u'system.indexes']

The find_one() method allows you to quickly inspect your collection and take a peek at a sample document.

In [10]: raw.find_one()

Out[10]:

{u'_id': ObjectId('4b496cddb15cb004a4000000'), u'format': u'-', u'method': u'GET', u'size': 824477.0, u'source': u'192.168.1.254', u'squidcode': u'TCP_MISS/200', u'stamp': 1263096815.7609999, u'url': u'http://netflix086.as.nflximg.com.edgesuite.net/sa0/166/1680180166.wmv/range/660083845-660907844?'}

The distinct() method does have some limitations, as I discovered the hard way, as you an see from this exception.

In [13]: raw.distinct("stamp") --------------------------------------------------------------------------- OperationFailure Traceback (most recent call last) /root/ /usr/lib/python2.4/site-packages/pymongo-1.3-py2.4-linux-i686.egg/pymongo/collection.pyc in distinct(self, key) /usr/lib/python2.4/site-packages/pymongo-1.3-py2.4-linux-i686.egg/pymongo/cursor.pyc in distinct(self, key) /usr/lib/python2.4/site-packages/pymongo-1.3-py2.4-linux-i686.egg/pymongo/database.pyc in _command(self, command, allowable_errors, check, sock)

OperationFailure: command SON([('distinct', u'raw'), ('key', 'stamp')]) failed: assertion: distinct too big, 4mb cap

So in my previous blog (using JavaScript) I introduced queries but you really can't do anything useful without using a cursor. If you've ever done any MySQL coding before you should be familiar with the concept. Basically it allows you to iterate through the results of a query.

Here we have the same expressions but you obviously need to quote the gt in Python.

In [29]: c = raw.find( {'stamp': { "$gt": 1263096815 }})
In [31]: c.count()
Out[31]: 2060


and

In [23]: c = raw.find({'squidcode':'TCP_DENIED/403'})
In [24]: c.count()

Out[24]: 2999


For the sake of this exercise, we only want to see 3 results so we call the limit() method.

In [26]: c.limit(3)

Out[26]:


Now we can iterate through the results of our query.

In [27]: for e in c:
....: print e
....:
....:

{u'squidcode': u'TCP_DENIED/403', u'format': u'-', u'stamp': 1262520969.721, u'source': u'192.168.1.254', u'url': u'http://www.bing.com/favicon.ico', u'_id': ObjectId('4b496ea4b15cb004a6000000'), u'method': u'GET', u'size': 1419.0}

{u'squidcode': u'TCP_DENIED/403', u'format': u'-', u'stamp': 1262521126.928, u'source': u'192.168.1.254', u'url': u'http://www.msn.com/', u'_id': ObjectId('4b496ea4b15cb004a600003e'), u'method': u'GET', u'size': 1395.0}

{u'squidcode': u'TCP_DENIED/403', u'format': u'-', u'stamp': 1262521127.654, u'source': u'192.168.1.254', u'url': u'http://www.bing.com/favicon.ico', u'_id': ObjectId('4b496ea4b15cb004a600003f'), u'method': u'GET', u'size': 1419.0}

So if we try again, what happens?

In [28]: for e in c:

print e
....:
....:

Nada. We have to rewind the cursor object to be able iterate again.

In [30]: c.rewind()
Out[30]:
In [31]: for e in c:
print e ....: ....:

{u'squidcode': u'TCP_DENIED/403', u'format': u'-', u'stamp': 1262520969.721, u'source': u'192.168.1.254', u'url': u'http://www.bing.com/favicon.ico', u'_id': ObjectId('4b496ea4b15cb004a6000000'), u'method': u'GET', u'size': 1419.0}

You can also manually iterate through these by calling next()

In [51]: cr.next()

Out[51]:

{u'_id': ObjectId('4b496ea4b15cb004a6000000'),
u'format': u'-', u'method': u'GET', u'size': 1419.0, u'source': u'192.168.1.254', u'squidcode': u'TCP_DENIED/403', u'stamp': 1262520969.721, u'url': u'http://www.bing.com/favicon.ico'}

In [52]: result = cr.next()


Guess what, your limit will still apply so if you want to clear it you can do a cr.rewind() and cr.limit(0) and then you can manually iterate through with cr.next()

Dummies Guide to MongoDB Queries using Squid Logs (JavaScript Shell Edition)

So the MongoDB develop documentation is actually pretty decent, but it doesn't really use examples with real data. For me, it made it more difficult for some of the API and shell commands to sink in.

So to generate some real world queries I created a python script that parsed the access.log file[s] generated by squid. I'll follow this blog with one that covers pymongo but I think this will be helpful, and like most of the posts will provide a good reference because when you are rapidly approaching 40 not only your eyes go, but your memory. So here goes...

First of all this assumes you are running the mongo JavaScript shell and yeah I know running from root is a bad idea and not even necessary (I don't think) but sue me.

root@opti620:~/mongodb# ./bin/mongo
MongoDB shell version: 1.2.1
url: test
connecting to: test
type "help" for help
> show dbs
admin
local
mongosquid
test
> use mongosquid
switched to db mongosquid
> show collections
raw
system.indexes
>

Now let's have some fun. This was actually when I just imported a few lines in from the log file so there are a relatively small number of documents. A collection is essentially like a table but since this is #nosql it really isn't a table. It is just collection of documents. We'll see those next.

> db.raw.find().count()
1029
> db.raw.find()[1029]
> db.raw.find()[1028]
{
"_id" : ObjectId("4b496cddb15cb004a4000404"),
"squidcode" : "TCP_MISS/200",
"source" : "192.168.1.254",
"stamp" : 1263102993.841,
"format" : "-",
"url" : "agmoviecontrol.netflix.com:443",
"method" : "CONNECT",
"size" : 17499
}

The JSON above is the "document." Something you'll notice is there are two different data types basically strings and floating points. The size field and timestamp are obviously floats. That hash looking thing is actually a hash or GUID that is supposedly unique.

So one of the cool built in queries is to return only the unique values for a given field. This is handled by the distinct method.

So we can see here that there were HTTP Posts.

> db.raw.distinct("method")
[ "CONNECT", "GET" ]

And because of my screwed up natting I can't tell which of my kids was going to netflix.

> db.raw.distinct("source")
[ "192.168.1.254" ]

> db.raw.distinct("url")
....
"http://netflix086.as.nflximg.com.edgesuite.net/sa0/725/1985205725.wma/range/9247565-9735184?",
"http://netflix086.as.nflximg.com.edgesuite.net/sa0/725/1985205725.wma/range/9735185-10219794?",
"http://netflix086.as.nflximg.com.edgesuite.net/sa0/725/1985205725.wma/range/985115-1469724?"

So remember when I discussed types above, if we wanted to retrieve all the transactions that were greater than 1MB we could do the following, but there are obviously more to it than that.

> db.raw.find( {size: { $gt:1000000}} )
{ "_id" : ObjectId("4b496cddb15cb004a4000162"), "squidcode" : "TCP_MISS/200", "source" : "192.168.1.254", "stamp" : 1263097489.996, "format" : "-", "url" : "http://netflix086.as.nflximg.com.edgesuite.net/sa0/166/1680180166.wmv/range/143155845-144163844?", "method" : "GET", "size" : 1008478 }
{ "_id" : ObjectId("4b496cddb15cb004a40003b0"), "squidcode" : "TCP_MISS/200", "source" : "192.168.1.254", "stamp" : 1263099100.207, "format" : "-", "url" : "http://netflix086.as.nflximg.com.edgesuite.net/sa0/166/1680180166.wmv/range/400771845-401779844?", "method" : "GET", "size" : 1008478 }

I was pleased to find that you can use regular expressions. The first query tells me there are 3199 documents that have port 443 in them and the 2nd query returns the first document. One of the things I noticed is that retrieving the document based on the "index" is really really slow. But I believe that is because it isn't really an index, but we'll get to them later.

> db.raw.find ( { url: /:443/ }).count()
3199
> db.raw.find ( { url: /:443/ })[0]
{
"_id" : ObjectId("4b496cddb15cb004a4000093"),
"squidcode" : "TCP_MISS/200",
"source" : "192.168.1.254",
"stamp" : 1263096929.091,
"format" : "-",
"url" : "agmoviecontrol.netflix.com:443",
"method" : "CONNECT",
"size" : 96222
}
> db.raw.find ( { url: /:443/ })[0:3]
Sun Jan 10 01:16:11 JS Error: SyntaxError: missing ] in index expression (shell):0

You'll notice that array slices don't work, but they do in Python, obviously which I'll blog on next.

Saturday, January 09, 2010

FreeBSD 8.0 with rum0 and wpa_supplicant on Lenovo S10-2

It looks like the driver for rum has changed slightly in FreeBSD 8.0 from FreeBSD 7.2 because I was not able to use the same command-line syntax as I did previously. Basically the only thing different I did was the ifconfig wlan create...

I had this card running on old Dell Optiplex acting as a bridge for my kids network (and they were watching a lot of streaming media) and I was surprisingly impressed with it. Decent performance.

mfranz-bsd8#
ugen4.3: at usbus4
rum0: on usbus4
rum0: MAC/BBP RT2573 (rev 0x2573a), RF RT2528

mfranz-bsd8# cat /etc/wpa_supplicant.conf
network={
ssid="xxx"
psk="xxxx"
}


mfranz-bsd8# ifconfig wlan create wlandev rum0
wlan0
mfranz-bsd8# ifconfig wlan0
wlan0: flags=8802 metric 0 mtu 1500
ether 00:1c:10:e6:1a:02
media: IEEE 802.11 Wireless Ethernet autoselect (autoselect)
status: no carrier
ssid "" channel 1 (2412 Mhz 11b)
country US authmode OPEN privacy OFF txpower 0 bmiss 7 scanvalid 60
bgscan bgscanintvl 300 bgscanidle 250 roam:rssi 7 roam:rate 1
bintval 0
mfranz-bsd8#

mfranz-bsd8# wpa_supplicant -c /etc/wpa_supplicant.conf -i wlan0
CTRL-EVENT-SCAN-RESULTS
Trying to associate with xxxxxxxxxx (SSID='xxxxxxxx' freq=2437 MHz)
Associated with xxxxxxxxxxx
WPA: Key negotiation completed with xxxxxxxxxxx [PTK=CCMP GTK=TKIP]
CTRL-EVENT-CONNECTED - Connection to xxxxxxxxxx completed (auth) [id=0 id_str=]



And while I'm at it, I hadn't seen any who actually installed 8.0 on a Lenovo Netbook but so far so good. I've got X working (I'll blog on that later) and re seems to work well enough. Obviously the Broadcom 4312's aren't going to work, but if you have USB wifi card or a tether you will be ok.

Next step see if I can get my Novatel u727 card working. I suspect it should work just fine, because it worked well on OpenBSD, but you never know...


Copyright (c) 1992-2009 The FreeBSD Project.
Copyright (c) 1979, 1980, 1983, 1986, 1988, 1989, 1991, 1992, 1993, 1994
The Regents of the University of California. All rights reserved.
FreeBSD is a registered trademark of The FreeBSD Foundation.
FreeBSD 8.0-RELEASE #0: Sat Nov 21 15:48:17 UTC 2009
root@almeida.cse.buffalo.edu:/usr/obj/usr/src/sys/GENERIC
Timecounter "i8254" frequency 1193182 Hz quality 0
CPU: Intel(R) Atom(TM) CPU N270 @ 1.60GHz (1602.40-MHz 686-class CPU)
Origin = "GenuineIntel" Id = 0x106c2 Stepping = 2
Features=0xbfe9fbff
Features2=0x40c39d>
AMD Features2=0x1
TSC: P-state invariant
real memory = 1073741824 (1024 MB)
avail memory = 1026433024 (978 MB)
ACPI APIC Table:
FreeBSD/SMP: Multiprocessor System Detected: 2 CPUs
FreeBSD/SMP: 1 package(s) x 1 core(s) x 2 HTT threads
cpu0 (BSP): APIC ID: 0
cpu1 (AP/HT): APIC ID: 1
ioapic0: Changing APIC ID to 4
ioapic0 irqs 0-23 on motherboard
kbd1 at kbdmux0
acpi0: on motherboard
acpi0: [ITHREAD]
acpi0: Power Button (fixed)
Timecounter "ACPI-fast" frequency 3579545 Hz quality 1000
acpi_timer0: <24-bit> port 0x408-0x40b on acpi0
acpi_ec0: port 0x62,0x66 on acpi0
acpi_hpet0: iomem 0xfed00000-0xfed003ff on acpi0
Timecounter "HPET" frequency 14318180 Hz quality 900
acpi_button0: on acpi0
acpi_lid0: on acpi0
acpi_button1: on acpi0
pcib0: port 0xcf8-0xcff on acpi0
pci0: on pcib0
vgapci0: port 0x60f0-0x60f7 mem 0x58280000-0x582fffff,0x40000000-0x4fffffff,0x58300000-0x5833ffff irq 16
at device 2.0 on pci0
agp0: on vgapci0
agp0: detected 7932k stolen memory
agp0: aperture size is 256M
vgapci1: mem 0x58200000-0x5827ffff at device 2.1 on pci0
pci0: at device 27.0 (no driver attached)
pcib1: at device 28.0 on pci0
pci1: on pcib1
pcib2: at device 28.1 on pci0
pci2: on pcib2
pci2: at device 0.0 (no driver attached)
pcib3: at device 28.2 on pci0
pci3: on pcib3
re0: port 0x2000-0x20ff mem 0x52010000-0x52010fff,0x52000000-0x5200ffff irq 18 at
device 0.0 on pci3
re0: Using 1 MSI messages
re0: Chip rev. 0x24800000
re0: MAC rev. 0x00400000
miibus0: on re0
rlphy0: PHY 1 on miibus0
rlphy0: 10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, auto
re0: Ethernet address: 00:26:22:0b:07:28
re0: [FILTER]
pcib4: at device 28.3 on pci0
pci4: on pcib4
uhci0: port 0x60a0-0x60bf irq 16 at device 29.0 on pci0
uhci0: [ITHREAD]
uhci0: LegSup = 0x0f00
usbus0: on uhci0
uhci1: port 0x6080-0x609f irq 17 at device 29.1 on pci0
uhci1: [ITHREAD]
uhci1: LegSup = 0x0f00
usbus1: on uhci1
uhci2: port 0x6060-0x607f irq 18 at device 29.2 on pci0
uhci2: [ITHREAD]
uhci2: LegSup = 0x0f00
usbus2: on uhci2
uhci3: port 0x6040-0x605f irq 19 at device 29.3 on pci0
uhci3: [ITHREAD]
uhci3: LegSup = 0x0f00
usbus3: on uhci3
ehci0: mem 0x58344400-0x583447ff irq 16 at device 29.7 on pci0
ehci0: [ITHREAD]
usbus4: EHCI version 1.0
usbus4: on ehci0
pcib5: at device 30.0 on pci0
pci5: on pcib5
isab0: at device 31.0 on pci0
isa0: on isab0
atapci0: port 0x1f0-0x1f7,0x3f6,0x170-0x177,0x376,0x60c0-0x60cf irq 16 at device 31.1 on pci0
ata0: on atapci0
ata0: [ITHREAD]
atapci1: port 0x60d8-0x60df,0x60fc-0x60ff,0x60d0-0x60d7,0x60f8-0x60fb,0x6020-0x602f mem 0x583440
00-0x583443ff irq 17 at device 31.2 on pci0
atapci1: [ITHREAD]
atapci1: AHCI called from vendor specific driver
atapci1: AHCI v1.10 controller with 4 1.5Gbps ports, PM not supported
ata2: on atapci1
ata2: [ITHREAD]
ata3: on atapci1
ata3sm0: irq 12 on atkbdc0
psm0: [GIANT-LOCKED]
psm0: [ITHREAD]
psm0: model Generic PS/2 mouse, device ID 0
cpu0: on acpi0
est0: on cpu0
p4tcc0: on cpu0
cpu1: on acpi0
est1: on cpu1
p4tcc1: on cpu1
pmtimer0 on isa0
orm0: at iomem 0xcf000-0xcffff pnpid ORM0000 on isa0
sc0: at flags 0x100 on isa0
sc0: VGA <16 flags="0x300">
vga0: at port 0x3c0-0x3df iomem 0xa0000-0xbffff on isa0
ppc0: parallel port not found.
Timecounters tick every 1.000 msec
usbus0: 12Mbps Full Speed USB v1.0
usbus1: 12Mbps Full Speed USB v1.0
usbus2: 12Mbps Full Speed USB v1.0
usbus3: 12Mbps Full Speed USB v1.0
usbus4: 480Mbps High Speed USB v2.0
ad4: 152627MB at ata2-master SATA150
ugen0.1: at usbus0
uhub0: on usbus0
ugen1.1: at usbus1
uhub1: on usbus1
ugen2.1: at usbus2
uhub2: on usbus2
ugen3.1: at usbus3
uhub3: on usbus3
ugen4.1: at usbus4
uhub4: on usbus4
: [ITHREAD]
GEOM: ad4: partition 1 does not start on a track boundary.
GEOM: ad4: partition 1 does not end on a track boundary.
uhub0: 2 ports with 2 removable, self powered
uhub1: 2 ports with 2 removable, self powered
uhub2: 2 ports with 2 removable, self powered
uhub3: 2 ports with 2 removable, self powered
Root mount waiting for: usbus4
Root mount waiting for: usbus4
Root mount waiting for: usbus4
uhub4: 8 ports with 8 removable, self powered
Root mount waiting for: usbus4
Root mount waiting for: usbus4
ugen4.2: at usbus4
Trying to mount root from ufs:/dev/ad4s2a
ugen0.2: at usbus0
ums0: on usbus0
ums0: 2 buttons and [XY] coordinates ID=0
drm0: on vgapci0
vgapci0: child drm0 requested pci_enable_busmaster
info: [drm] AGP at 0x40000000 256MB
info: [drm] Initialized i915 1.6.0 20080730

Thursday, January 07, 2010

Some Shallow & Superficial Reasons for Picking MongoDB for your [web]app



So first got turned on to #nosql databases a little over (or under) a year ago with CouchDB but lately I've been quite enamored with MongoDB as of late.

So forgot about deep architectural reasons for using it. Here are some quite practical some practical reasons, when you are a not full-time developer (or database guru) but you find yourself doing development that involves a data store and the thought of using MySQL (so like 2000s) in your app:
  • Abhorrence for schemas, ORMs, and migrations - this is basically the laziness argument. Basically I want/need to store stuff. And the stuff I want to store might change and I don't want to have to deal with changing the schema (and my) app to adapt to those changes. This was document oriented databases like CouchDB and MySQL rule. If everything is a JSON object it finds a great place for you to store stuff.
  • Ease of Installation & Compilation -- yep CouchDB has been in the latest Ubuntu repos for a while, but I use Lenny/Hardy server side, so forget about it. Dealing with Erlang (and finding all the dependencies to build SpiderMonkey was a big pain) the ass. Beam, what the hell is beam? Mongo has 32/64 bit Linux binaries that just work and a briefly managed to get it to compile on FreeBSD 7.2. And unlike some of the others out there it doesn't require require a JRE.
  • Map/Reduce hurts my head - ease of use is one of the key differentiators between Mongo and CouchDB is that is the simplicity of queries. I'm not an expert yet, but having to create Map/Reduce functions to create views to get at your data, it was a slippery concept for me.
  • Non-HTTP Transport -- unlike CouchDB, Mongo has a binary client/server protocol and doesn't used HTTP.
There are also some really cool features like capped collection that should be useful for the app I'm working on, but these were some of the reasons why I went with Mongo. Back to coding...


Tuesday, January 05, 2010

Pansy or Victim?





So unfortunately some who I [used to] follow over on @frednecksec cited an article over on prisonplanet.com which allowed me to check out the cool sponsors such as the one pictured above but don't forget Silverlungs.

To each his own, I inhaling gaseous gold myself. Much better preparation for the "End Times," the "New World" or whatever the "elites" have in store for us.