Saturday, December 01, 2007

Using Hashes Like it is 1999

This week I picked up [what I thought would be] a quick logfile analysis task. Things started out great. I took the time to look at the logfile format and generalized 4-5 different messages (with appropriate regexes to get the data I needed) generated by the security device. Next I extended a basic "logrunner" class I wrote last month for analyzing the debug output from the Intel FreeBSD drivers (basically you do some sysctl's and it dumps some kernel messages to see counters missed, received packets--much better than netstat).

In my logrunner class, you basically can "attach" various simple regex matches and a symbol and you get a nice hash back with the values you want and it hides all the low-level details of matching or handling time stamps, etc. (HINT: If you are mucking with syslog files in Ruby and you are not using the Time API, you are a fool, but I digress).

After a few hours in I thought things were going fine, before some distractions kept me from working on it again until until the next afternoon (I overconfidently estimated this would take about 4 hours from start to finish), so I was in a rush. The initial desire to develop something be a more general purpose tool and that was designed properly was replaced with the brute force, quick hack, get-r-done approach.

I ended up iterating through the output hashes output by the logrunner tool to create more hashes some with the IP address as a key, others with a username as the key. And all of this pointed to at least another hash (or two) so I ended up with something like:

blah[blah][blah] = { 1 => { a => b, c => d }, 3 => { a=> q, d => z } }

This would have been a trivial task except there was no single session identifier (or even username or IP address) on each line that I could tie the various pieces of data together. Then I kept getting confused (and alternating between |k| and |k,v| with my Ruby blocks) it took my longer than I had hoped but I was done in about 7. I had the output I wanted. Went from a few hundred megs of logs to a nice Excel-friendly CSV file. And I thought I was done.

Until Friday afternoon, when I found it some additional data was needed. Extracting the data wasn't a problem (that was done in 5 minutes), but correlating it and getting the report format was. Should I add another hash? Redefine the hashes I'd written? Five o'clock on Friday (with restless hungry kids) is not a time for clarity of thought, but this morning I realized the Ruby I was written was as unreadable as the Perl I used to write back in the day.

Spending the afternoon driving out in the snow which turned to sleet which turned to rain finally beat some sense into me. I mapped out the data on paper (this time) and did right. Came up with 6 simple classes (2 base and 4 sub) to abstract away the hashes and ended up with less than 1/10th of the lines of code in the main loop and a 1/3 of the iterations. Nothing fancy, no Ruby foo, nothing that couldn't be done in Python. And the code is actually readable. The moral of the story? If you are using hashes 4-6 levels deep you have a problem. Stop, step away from the keyboard and come up with a cleaner design. Do it right the first time, you won't regret it. Because quick hacks have a funny way of running on systems for a long, long time.

No comments: