<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>neilkodner.com &#187; python</title>
	<atom:link href="http://www.neilkodner.com/tag/python/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.neilkodner.com</link>
	<description>Data Driven.  Since 1971.</description>
	<lastBuildDate>Sun, 23 Oct 2011 16:40:04 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3.1</generator>
		<item>
		<title>An analysis of Steve Jobs tribute messages displayed by Apple</title>
		<link>http://www.neilkodner.com/2011/10/an-analysis-of-steve-jobs-tribute-messages-displayed-by-apple/</link>
		<comments>http://www.neilkodner.com/2011/10/an-analysis-of-steve-jobs-tribute-messages-displayed-by-apple/#comments</comments>
		<pubDate>Thu, 20 Oct 2011 21:08:26 +0000</pubDate>
		<dc:creator>admin</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[analysis]]></category>
		<category><![CDATA[apple]]></category>
		<category><![CDATA[data]]></category>
		<category><![CDATA[datamining]]></category>
		<category><![CDATA[nlp]]></category>
		<category><![CDATA[nltk]]></category>
		<category><![CDATA[python]]></category>
		<category><![CDATA[stevejobs]]></category>

		<guid isPermaLink="false">http://www.neilkodner.com/?p=569</guid>
		<description><![CDATA[Two weeks have passed since Apple&#8217;s Co-Founder/CEO Steve Jobs passed away.  Upon his passing, Apple encouraged people to share their memories, thoughts, and feelings by emailing rememberingsteve@apple.com. Earlier this week, Apple posted a site (http://www.apple.com/stevejobs) in tribute to Steve Jobs. According to the site, over a million people have submitted messages. The site cycles through the submitted [...]]]></description>
			<content:encoded><![CDATA[<p>Two weeks have passed since Apple&#8217;s Co-Founder/CEO Steve Jobs passed away.  Upon his passing, Apple encouraged people to share their memories, thoughts, and feelings by emailing <a href="https://mail.google.com/mail/?view=cm&amp;fs=1&amp;tf=1&amp;to=rememberingsteve@apple.com" target="_blank">rememberingsteve@apple.com</a>. Earlier this week, Apple posted a <a href="http://www.apple.com/stevejobs/" target="_blank">site</a> (<a href="http://www.apple.com/stevejobs/" target="_blank">http://www.apple.com/stevejobs</a>) in tribute to Steve Jobs. According to the site, over a million people have submitted messages. The site cycles through the submitted messages.</p>
<p>I decided to take a closer look at what people are saying about Steve Jobs, as a whole. Looking at how the site updates, it appears to use Ajax to retrieve and display new messages. Using Chrome&#8217;s developer tools, I monitored the requests it was making to get the new messages.</p>
<p><a href="http://www.neilkodner.com/wp-content/uploads/2011/10/Apple-Remembering-Steve-Jobs2.png"><img class="alignnone size-large wp-image-574" title="Apple - Remembering Steve Jobs" src="http://www.neilkodner.com/wp-content/uploads/2011/10/Apple-Remembering-Steve-Jobs2-1024x892.png" alt="" width="819" height="714" /></a><br />
Once I found the location of the individual messages, it was trivial to download all of them. The message endpoint URLs are in the format</p>
<pre class="brush: xml; title: ; notranslate">

http://www.apple.com/stevejobs/messages/3679.json?28106802
</pre>
<p>and a sample message looks like</p>
<pre class="brush: jscript; title: ; notranslate">
{
mainText: &quot;This is equivalent to my mom's generation of Elvis dying for me. I am very
sadden and emotionally moved at the moment. He was more influential on my
life than my parents and friends. While my parents loved me and friends
shared fun times. Steve influenced me, motivated me to become the innovated,
creative technologist I have become. I got into computer technology in 1980
and moved to Silicon Valley because of him. I have been one of his biggest
admirers and looked to him as a mentor to push the boundaries of my own
creative abilities to develop technology solutions which I hope made a
difference and impact to the industries I worked in. We've lost a
significant influence and icon in technology. We won't see another person of
his innovation and foresight within my life time. He was the Edison of
technology. He was and is one of my biggest inspirations.

I feel I have lost a close family member&quot;
header: &quot;What Steve Jobs meant to me&quot;
author: &quot;Skip&quot;
location: &quot;&quot;
}
</pre>
<p>The site makes a request to <a href="http://www.apple.com/stevejobs/messages/main.json" target="_blank">http://www.apple.com/stevejobs/messages/main.json</a> which returns</p>
<pre class="brush: jscript; title: ; notranslate">
 {
 totalMessages: &quot;10975&quot;
 timestamp: &quot;28106802&quot;
 }
</pre>
<p>So it appears that it cycles through 10975 messages. I didn&#8217;t decompose the javascript powering the site to determine this, I just made an assumption. I tried querying values greater than 10975 and they returned 404. I wrote a quick python program to download the messages:</p>
<pre class="brush: python; title: ; notranslate">
#!/usr/bin/python
import urllib2
import simplejson as json
import time
import codecs

# a page on apple's site shows the # of messages available
# start with 0 and retrieve up to message_range messages
metadata = json.loads(urllib2.urlopen('http://www.apple.com/stevejobs/messages/main.json').read())
message_range = metadata['totalMessages']

# the url for each message. i learned of this url by inspecting
# the network calls to http://www.apple.com/stevejobs
# using chrome's developer tools
url=&quot;http://www.apple.com/stevejobs/messages/%d.json&quot;

# create our destination file
# i'm using codecs because it does a better job at handling international characters
output_file = 'stevejobs_tribute.txt'
file_handle = codecs.open(output_file,'w','utf-8')

# helper function to remove tabs and linefeeds
def clean(txt):
  return txt.replace('\n','').replace('\t','')

# iterate from 0 to the max # of messages and download the message text
# for these purposes, I'm ignoring the other fields as they weren't always present
for i in range(0, message_range):
  req = url % i
  data = urllib2.urlopen(req).read()
  data = json.loads(data)
  file_handle.write(clean(data['mainText']) + '\n')
file_handle.close()
</pre>
<p><span style="direction: ltr;"><br />
</span><br />
<span style="direction: ltr;">So now, we have over ten thousand tribute messages saved to the file <a href="https://github.com/neilkod/steve_jobs_tribute_messages/tree/master/data">stevejobs_tribute.txt</a>. What I was most interested in seeing how many of these messages contain a reference to a certain Apple product.</span><br />
I came up with a few search terms based on some legendary Apple product names including</p>
<ul>
<li>Newton</li>
<li>Macintosh</li>
<li>MacBook</li>
<li>iBook</li>
<li>Mac</li>
<li>iPhone</li>
<li>iPod</li>
<li>iMac</li>
<li>iPad</li>
<li>Apple II family</li>
<li>OSX</li>
<li>iMovie</li>
<li>Apple TV</li>
<li>iTunes</li>
<li>LaserWriter (yes, <a href="http://en.wikipedia.org/wiki/LaserWriter" target="_blank">Laserwriter</a>)</li>
</ul>
<div>Each product received an entry in a python dictionary. The value is another dictionary containing a regex for the product name and a count for the running totals. Some of the regular expressions are as simple as testing for an optional s at the end of the product name, some are a little more complex &#8211; check the Apple II regular expression to match all of entire product Apple 2 line. As I&#8217;m ok but not great with regular expressions, I welcome your corrections.</div>
<pre class="brush: python; title: ; notranslate">
products = {'iPhone':{'regex':'iphones?','count':0},
	'iMac':{'regex':'imacs?','count':0},
	'iPad':{'regex':'ipads?','count':0},
	'iTunes':{'regex':'itunes','count':0},
	'iPod':{'regex':'ipods?','count':0},
	'cube':{'regex':'cubes?','count':0},
	'MacBook':{'regex':'macbooks?','count':0},
	'iBook':{'regex':'ibooks?','count':0},
	'Apple TV':{'regex':'apple ?tvs?','count':0},
	'Apple II Family':{'regex':r'(apple )?(2|ii|\]\[|\/\/)([ce\+|]|gs|s)?[^0-9]', 'count':0},
	'LaserWriter':{'regex':'laserwriter?','count':0},
	'PowerBook':{'regex':'powerbook?','count':0},
	'Newton':{'regex':'newton?','count':0},
	'OSX':{'regex':'osx','count':0},
	'iMovie':{'regex':'imovie','count':0},
	'Macintosh':{'regex':'macintosh','count':0},
	'Lisa':{'regex':'lisa','count':0},
	'Mac':{'regex':'mac','count':0},
}
</pre>
<p>Here&#8217;s a screenshot of me testing the Apple II regular expression, using the excellent <a href="http://gskinner.com/RegExr/" target="_blank">Regexr</a>.</p>
<p><a href="http://www.neilkodner.com/wp-content/uploads/2011/10/apple-2-regex-testing.png"><img class="alignnone size-full wp-image-623" title="apple 2 regex testing" src="http://www.neilkodner.com/wp-content/uploads/2011/10/apple-2-regex-testing.png" alt="" width="424" height="388" /></a></p>
<p>Overall, out of 10975 messages downloaded(as of now), 2,186, or just under 20% mentioned an apple product by name. Here&#8217;s the breakdown of the products mentioned:</p>
<pre class="brush: plain; title: ; notranslate">
LaserWriter        1
iMovie             3
OSX                9
iBook             22
PowerBook         22
Lisa              24
Apple TV          31
Newton            33
iTunes            52
Macintosh        163
iMac             235
MacBook          366
Apple II Family  481
iPad             574
iPod             575
iPhone           875
Mac             1315
</pre>
<p>More than one out of every ten messages included a reference to a Mac! Nearly one in ten mentioned an iPhone &#8211; not bad for a device that&#8217;s been out a fraction of the time the Mac has been available.I&#8217;m pleased to see so many references to the Apple II including several mentions of the//c, which was my first Apple product.</p>
<p>It&#8217;s also interesting to note that out of 33 mentions of Newton, only a handful of those were about the actual Apple product &#8211; most were comparing Steve Jobs to Newton himself. Check out my <a href="http://www.neilkodner.com/2010/10/fun-with-nltk-and-zoolander-part-1-concordance/" target="_blank">earlier post on NLTK concordance</a> for details on how I did this:</p>
<pre class="brush: python; title: ; notranslate">
import nltk
import string
f = open('stevejobs_tribute.txt').read()
f = f.translate(string.maketrans(&quot;&quot;,&quot;&quot;), string.punctuation)
foo=nltk.Text(f.split())
print foo.concordance('newton')
</pre>
<p>result:</p>
<pre class="brush: plain; title: ; notranslate">
op If history misses men like Isaac Newton Graham Bell Galileu Thomas Edison a
mbered though his legacy Now he met Newton Einstein and other geniuses like hi
oday I was one of the few who had a Newton Today I have an iPhone 4 an iPad2 a
oduct that came thereafter from the Newton to the Cube to the iPhone 4S God Bl
with the likes of Edison Garcia and Newton for his impact and vision I wish hi
ntioned in the same breath as Isaac Newton Thomas Edison and Bill Gates The le
 off a tree we are thinking of Adam Newton and Steve Jobs He open new dimensio
Jobs will be missed Da Vinci Mozart Newton Franklin Jobs Nobody is out of plac
ged my life starting with the Apple Newton followed by the iPod and then the i
 sorely missed nbsp Da Vinci Mozart Newton Franklin Jobs Nobody is out of plac
ve dared to Einstein Freud Da Vinci Newton Galileo Darwin among others is prou
embered beside Einstein Pasteur and Newton The world is moving toward his crea
irst Apple Mac I remember the first Newton I willnbspremembernbspSteves creati
e to contact us againnbsp How Isaac Newton and Albert Einstein contributed gre
 world One seduced Eve One awakened Newton and One was in the hands of Steve J
the way you have influenced mine If Newton discovered something as remarkable
rld One seduced Eve second awakened Newton the third one was in the hands of S
lent to Leonardo Da Vinci Sir Issac Newton Albert Einstein and the like He was
t of the caliber of that of DaVinci Newton Pythagorous etc The list can go on
hen people say names like ie Edison Newton and Einstein I guarantee that the n
 Computers” The Apple II Lisa Mac Newton iPod iTunes store iPod Touch iPhone
ember Steve Jobs the way I remember Newton or Einstein I lived with Apple prod
set consultant who bought his first Newton MacBook 170 and all the dozens of o
 br 3 Apples change the world Adán Newton Steve Jobs 19552011 Rest in Peace t
back to the Apple IIGS I also had a Newton Steve Jobs death hurts me personall
ed the world apple to adam apple to newton and apple to steve jobs Steve was a
dam and Eva Second one that wake up newton third one that Steve Jobs create St
</pre>
<p>Also interesting where the number of mentions to other historical figures in the Steve Job remembrance messages. According to the submitters, Steve Jobs is clearly in some elite company. I don&#8217;t know if I&#8217;d go so far as to group him with the man who brought automobiles and light bulbs to the masses but hey, we all have our priorities. All counts were determined through a simple grep command piped to wc -l.Here are a few examples:</p>
<ul>
<li>Einstein &#8211; 70</li>
<li>Ford &#8211; 189</li>
<li>Edison &#8211; 110</li>
<li>DaVinci &#8211; 15</li>
<li>Bill Gates &#8211; 8</li>
</ul>
<p>Finally, I wanted to see what how people were speaking about Steve Jobs and especially what terms were being used to describe him. There was no point in performing sentiment analysis on this text as all of the texts were not only obviously positive but were also vetted by Apple for content. Using NLTK, I performed part-of-speech tagging on every word in each tribute message and then wrote some code to total the adjectives and adverbs used in the tribute messages.</p>
<p>The most commonly-used adjectives are</p>
<pre class="brush: plain; title: ; notranslate">
('great', 1961)
('steve', 1808)
('many', 1459)
('first', 917)
('sad', 862)
('better', 857)
('such', 727)
('best', 721)
('visionary', 645)
('new', 579)
('more', 556)
('true', 538)
('most', 476)
('creative', 471)
('apple', 435)
('other', 427)
('same', 415)
('good', 412)
('greatest', 376)
('wonderful', 373)
('sorry', 362)
('old', 325)
('brilliant', 283)
('able', 281)
('incredible', 267)
('big', 260)
</pre>
<p>Humorously, NLTK frequently considered &#8220;Steve&#8221; to be an adjective. This is likely because it is always followed by the proper noun &#8220;Jobs.&#8221; A <a href="http://twitter.com/#!/japerk/status/127054008060878848">tweet</a> from <a href="http://www.streamhacker.com">NLTK expert Jacob Perkins</a> reminded me that machines are dumb and proper nouns should be capitalized. In order to aggregate the counts, I normalized the text by converting to lowercase &#8211; I wasn&#8217;t interested in nouns, only adjectives so proper nouns didn&#8217;t matter to me.<br />
The top adverbs, according to NLTK, were not as interesting, at least to me.</p>
<pre class="brush: plain; title: ; notranslate">
('so', 2220)
('never', 2111)
('not', 1897)
('always', 1798)
('just', 1402)
('now', 1028)
('truly', 989)
('only', 945)
('very', 919)
('much', 908)
('ever', 751)
('even', 743)
('really', 567)
('forever', 508)
('more', 486)
('still', 447)
('well', 398)
('most', 375)
('personally', 352)
</pre>
<p>And finally, I ran tri-gram analysis, again using NLTK.<span style="direction: ltr;"> </span></p>
<pre class="brush: python; title: ; notranslate">
trigrams = defaultdict(int)
nltk_trigrams = nltk.trigrams(text)
for itm in nltk_trigrams:
  trigrams[itm] += 1
</pre>
<p>As one would expect, the leading trigram was &#8216;<strong>rest in peace</strong>&#8216; with 1838 mentions, 16.7% of all mentions. &#8216;<strong>thank you for</strong>&#8216; was found in 1446 messages, &#8216;<strong>will be missed</strong>&#8216; was found in 827 messages. Other interesting trigrams are &#8216;<strong>thank you steve</strong>&#8216; with 791 mentions and &#8216;<strong>changed the world</strong>&#8216; with 551 mentions.</p>
<p>The full python code and resulting data can be found on <a href="https://github.com/neilkod/steve_jobs_tribute_messages" target="_blank">github</a>.</p>
<pre class="brush: python; title: ; notranslate">

#!/usr/bin/python
#nltk.help.upenn_tagset('RB')
from collections import defaultdict
from operator import itemgetter
import re
import urllib2
import string
import simplejson as json

import codecs
import nltk

OUTPUT_FILE = 'data/stevejobs_tribute.txt'

adverbs = defaultdict(int)
adjectives = defaultdict(int)
trigrams = defaultdict(int)

message_has_adjective = False
message_has_adverb = False
message_contains_product_mention = False
messages_with_adjective = 0
messages_with_adverb = 0
messages = 0
messages_with_product_mention = 0

exclude = set(string.punctuation)

products = {'iPhone':{'regex':'iphones?','count':0},
	'iMac':{'regex':'imacs?','count':0},
	'iPad':{'regex':'ipads?','count':0},
	'iTunes':{'regex':'itunes','count':0},
	'iPod':{'regex':'ipods?','count':0},
	'cube':{'regex':'cubes?','count':0},
	'MacBook':{'regex':'macbooks?','count':0},
	'iBook':{'regex':'ibooks?','count':0},
	'Apple TV':{'regex':'apple ?tvs?','count':0},
	'Apple II Family':{'regex':r'(apple )?(2|ii|\]\[|\/\/)([ce\+|]|gs|s)?[^0-9]', 'count':0},
	'LaserWriter':{'regex':'laserwriter?','count':0},
	'PowerBook':{'regex':'powerbook?','count':0},
	'Newton':{'regex':'newton?','count':0},
	'OSX':{'regex':'osx','count':0},
	'iMovie':{'regex':'imovie','count':0},
	'Macintosh':{'regex':'macintosh','count':0},
	'Lisa':{'regex':'lisa','count':0},
	'Mac':{'regex':'mac','count':0},
}

def top_n(dct,n = 10):
	srtd=sorted(dct.iteritems(), key=itemgetter(1), reverse=True)
	for x in srtd[0:n+1]:
		print x

def nltk_concordance(term,text_file):
	f = open(text_file).read()
	# remove punctuation
	f = f.translate(string.maketrans(&quot;&quot;,&quot;&quot;), string.punctuation)
	split_text=nltk.Text(f.split())
	split_text.concordance(term,lines=100)

	# &gt;&gt;&gt; f = f.translate(string.maketrans(&quot;&quot;,&quot;&quot;), string.punctuation)
	# &gt;&gt;&gt; foo=nltk.Text(f.split())
	# &gt;&gt;&gt; print foo.concordance('newton')

def unescape(s):
	&quot;&quot;&quot;unescapes html codes&quot;&quot;&quot;
	s = s.replace(&quot;&lt;&quot;, &quot;	s = s.replace(&quot; &quot;, &quot; &quot;)
	# this has to be last:
	s = s.replace(&quot;&amp;&quot;, &quot;&amp;&quot;)
	return s

for line in open(OUTPUT_FILE):
	message_has_adjective = False
	message_has_adverb = False
	message_contains_product_mention = False

	# remove the trailing linefeed and convert to lower-case
	# and remove html control characters
	messages += 1
	data = line.strip()
	data = data.lower()
	data = unescape(data)

	# check for product mentions
	for k,v in products.iteritems():
		if re.search(v['regex'],data):
			products[k]['count'] += 1
			message_contains_product_mention = True

	# if the message contains a product mention
	# increment the product mention counter
	if message_contains_product_mention:
		messages_with_product_mention += 1

# tokenize the sentences using nltk's wordpuncttokenizer
	text = nltk.WordPunctTokenizer().tokenize(data)

# compute trigrams
	nltk_trigrams = nltk.trigrams(text)
	for itm in nltk_trigrams:
		trigrams[itm] += 1

# pos-tag each token. we're interested in adjectives and adverbs
	parts_of_speech = nltk.pos_tag(text)
	# test for adjectives and adverbs, increment the counters
	# when we find one.

	for (word,pos) in parts_of_speech:
		if pos.startswith('JJ'):
			message_has_adjective = True
			adjectives[word] += 1

		if pos.startswith('RB'):
			message_has_adverb = True
			adverbs[word] += 1

	# if the message contains an adverb or an adjective, increment a counter
	if message_has_adjective:
		messages_with_adjective += 1
	if message_has_adverb:
		messages_with_adverb += 1

# output the 25 most frequently-used adjectives and adverbs
n = 25
print &quot;top %s adverbs&quot; % n
top_n(adverbs, n)
print
print &quot;top %s adjectives&quot; % n
top_n(adjectives, n)

print &quot;messages with adjectives: %s&quot; % messages_with_adjective
print &quot;messages with adverbs: %s&quot; % messages_with_adverb
print &quot;total messages with product mentions: %s&quot; % messages_with_product_mention
print &quot;total messages: %s&quot; % messages

# output the top 50 most-common trigrams
n = 50
print &quot;top %s trigrams&quot; % n
top_n(trigrams, n)
srtd=sorted(products.iteritems(),key=itemgetter(1))
for x,y in srtd:
	print &quot;%s\t\t%s&quot; % (x,y['count'])

print
print
# concordance for newton
print &quot;concordance for newton:&quot;
nltk_concordance('newton',OUTPUT_FILE)
</pre>
]]></content:encoded>
			<wfw:commentRss>http://www.neilkodner.com/2011/10/an-analysis-of-steve-jobs-tribute-messages-displayed-by-apple/feed/</wfw:commentRss>
		<slash:comments>52</slash:comments>
		</item>
		<item>
		<title>Chuck Norris doesn&#8217;t screen-scrape, the data runs scared to his hard drive.</title>
		<link>http://www.neilkodner.com/2011/03/chuck-norris-doesnt-screen-scrape-the-data-runs-scared-to-his-hard-drive/</link>
		<comments>http://www.neilkodner.com/2011/03/chuck-norris-doesnt-screen-scrape-the-data-runs-scared-to-his-hard-drive/#comments</comments>
		<pubDate>Tue, 01 Mar 2011 20:51:10 +0000</pubDate>
		<dc:creator>admin</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[beautifulsoup]]></category>
		<category><![CDATA[data]]></category>
		<category><![CDATA[python]]></category>
		<category><![CDATA[screen-scraping]]></category>

		<guid isPermaLink="false">http://www.neilkodner.com/?p=534</guid>
		<description><![CDATA[Inspired by a tweet from Roger Ehrenberg and my 11-year-old son who&#8217;s crazy about Chuck Norris facts, I screen-scraped the contents of http://www.chucknorrisfacts.com. Code and data can be found here. Using Python and BeautifulSoup, it simply loops through all of the pages on http://www.chucknorrisfacts.com and reads the items displayed on the page. Output looks like Visit [...]]]></description>
			<content:encoded><![CDATA[<p>Inspired by a <a href="http://twitter.com/#!/infoarbitrage/status/42611272902115328">tweet</a> from <a href="http://twitter.com/#!/infoarbitrage">Roger Ehrenberg</a> and my 11-year-old son who&#8217;s crazy about Chuck Norris facts, I screen-scraped the contents of <a href="http://www.chucknorrisfacts.com/" target="_blank">http://www.chucknorrisfacts.com</a>. Code and data can be found <a href="https://github.com/neilkod/chucknorrisfacts">here</a>.</p>
<p>Using Python and <a href="http://www.crummy.com/software/BeautifulSoup/">BeautifulSoup</a>, it simply loops through all of the pages on <a title="www.chucknorrisfacts.com" href="http://www.chucknorrisfacts.com" target="_blank">http://www.chucknorrisfacts.com</a> and reads the items displayed on the page.</p>
<pre class="brush: python; title: ; notranslate">

#!/usr/bin/python
import urllib2, time
from BeautifulSoup import BeautifulSoup, BeautifulStoneSoup

# 674 pages last time I checked. Oddly enough, their pages seem zero-based. Additionally, if you
# substitute an arbitrary number, outside of the range of pages, you'll get data back instead
# of 404. I'm not sure why they're doing this.
for page_num in range(0,674):
	url = 'http://www.chucknorrisfacts.com/all-chuck-norris-facts?page=%d' % page_num
	html = urllib2.urlopen(url)
	soup = BeautifulSoup(html)

	entries = soup.findAll(&quot;li&quot;,&quot;views-row&quot;)
	for entry in entries:

		# use BeautifulStoneSoup to remove any HTML-escaped text that BS returns.
		the_quote = BeautifulStoneSoup(entry.div.text,
		                   convertEntities=BeautifulStoneSoup.HTML_ENTITIES).contents[0]

		# print it to stdout. I just redirect the program's output to a file.
		print the_quote.encode('utf-8')
	# be a good citizen and wait a few seconds before visiting the next page
	time.sleep(6)
</pre>
<p>Output looks like</p>
<pre class="brush: plain; title: ; notranslate">
if the mountain won't come to Muhammad, Chuck Norris will bring it.
if you watch the ring you die in 7 days,if you look at Ckuck Norris you die instantly
in a real zombie apocalypse, Chuck Norris can roundhouse-kick 53,596 zombies dead.
in space no-one can hear you scream.....except chuck norris!
iphone 4? chuck norris has iphone 8
most kids pee their name into snow... Chuck Norris pisses his in concreate...
never say you can'thurt a fly to chuck norris because he will hurt you
new never-before-seen behind-the-scenes shots from Walker Texas Range shows Chuck Norris carrying his truck home after it broke down
no one has ever found where the smurfs live thats  becuase they live  in chuck norrises beard
...
</pre>
<p>Visit my <a href="https://github.com/neilkod/chucknorrisfacts" target="_blank">github</a> for the full dataset (5500 entries).</pre>
]]></content:encoded>
			<wfw:commentRss>http://www.neilkodner.com/2011/03/chuck-norris-doesnt-screen-scrape-the-data-runs-scared-to-his-hard-drive/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Visualizations of Canabalt scores scraped from twitter</title>
		<link>http://www.neilkodner.com/2011/02/visualizations-of-canabalt-scores-scraped-from-twitter/</link>
		<comments>http://www.neilkodner.com/2011/02/visualizations-of-canabalt-scores-scraped-from-twitter/#comments</comments>
		<pubDate>Wed, 16 Feb 2011 22:56:46 +0000</pubDate>
		<dc:creator>admin</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[analysis]]></category>
		<category><![CDATA[api]]></category>
		<category><![CDATA[canabalt]]></category>
		<category><![CDATA[data]]></category>
		<category><![CDATA[python]]></category>
		<category><![CDATA[r]]></category>
		<category><![CDATA[twitter]]></category>

		<guid isPermaLink="false">http://www.neilkodner.com/?p=479</guid>
		<description><![CDATA[Canabalt, a ridiculously addicting web/IOS-device game allows one to show off their high scores, and their not-so-high scores to Twitter. Each of these tweets contains a bit of information &#8211; The score represented in meters, the method of death (hitting a wall and tumbling to my death) and the device (iPhone). Other useful information can [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://www.canabalt.com/">Canabalt</a>, a ridiculously addicting web/IOS-device game allows one to show off their high scores, and their <a href="http://twitter.com/#!/neilkod/status/37964035903324160">not-so-high scores</a> to Twitter.</p>
<p><a href="http://www.neilkodner.com/wp-content/uploads/2011/02/canabaltscore.png"><img class="alignnone size-medium wp-image-485" title="canabaltscore" src="http://www.neilkodner.com/wp-content/uploads/2011/02/canabaltscore-300x105.png" alt="" width="300" height="105" /></a></p>
<p>Each of these tweets contains a bit of information &#8211; The score represented in meters, the method of death (hitting a wall and tumbling to my death) and the device (iPhone). Other useful information can easily be extracted such as the date/time played and information about the user (name, location, friend count, follower count, etc). Over the next few weeks I aim to see what features, if any, has any influence on Canabalt scores.</p>
<p>The first thing I needed to do was capture the tweeted Canabalt scores. I have a process running on an EC2 micro instance that downloads tweets from the Twitter Streaming API based on certain key words, one of them being canabalt. The process loads each matching tweet into a MongoDB instance hosted on <a href="http://www.mongohq.com">MongoHQ.com</a>.</p>
<pre class="brush: bash; title: ; notranslate">

curl -s -u $TWITTER_USERNAME:$TWITTER_PASSWORD -d @/home/ec2-user/trackingkeywords http://stream.twitter.com/1/statuses/filter.json |/home/ec2-user/mongodb/bin/mongoimport &amp;
</pre>
<p>Where trackingkeywords is a file containing a comma-separated list of keywords that I track on twitter. Additionally, I left connection details out of the mongoimport command. You&#8217;ll need to provide a host, port, database, and collection into the mongoimport command.</p>
<p>I then run some python code to query the MongoDB instance and retrieve tweets mentioning Canabalt, based on a simple regular expression. I&#8217;m expecting the tweet to begin with &#8216;I&#8217; and contain the word Canabalt. Pretty naive but it worked fine. If it&#8217;s not a true Canabalt score, I&#8217;ll be able to determine in no time. From there, I use regular expressions to extract(for now) the score, the method of death, and the device name.</p>
<pre class="brush: python; title: ; notranslate">
def canabalt_tweets():

	# connect to MongoDB
	tweets = create_connection(False)

	# regular expression to extract components of a canabalt score
	canabalt_regexp = re.compile(r'I ran (\d{3,7})m before (.*) on my ([^.]+)\.')

	# regular expression to match tweets that begin with I ran and mention canabalt
	regexp = re.compile('^I ran .*canabalt')

	# create a MongoDB cursor(query)
	cur = tweets.conftweets.find({'text': regexp}, {'text': 1})

	# iterate through the cursor. If a tweet fits the pattern, print it.
	for item in cur:
		try:
			(score,death,device) = canabalt_regexp.search(item['text']).groups()
			print ','.join([strip_text(score),strip_text(death),strip_text(device)])
		except:
			pass
</pre>
<p>Function strip_text() is part of my data tools Bat-Utility Belt and cleans text by removing leading/trailing spaces, crlf, tabs and some other junk.</p>
<p>We now have some comma-separated data in this shape</p>
<pre class="brush: plain; title: ; notranslate">
score,death,device
2860,hitting a wall and tumbling to my death,iPhone
3427,hitting a wall and tumbling to my death,iPad
4496,hitting a wall and tumbling to my death,iPad
3635,missing another window,iPhone
2040,colliding with some enormous obstacle,iPhone
6017,somehow hitting the edge of a billboard,iPhone
8374,knocking a building down,iPhone
2939,hitting a wall and tumbling to my death,iPad
2021,turning into a fine mist,iPad
</pre>
<p>Now for some more fun &#8211; visualization and analysis. This is performed in R because, well, R is awesome. That, and I really need some more practice with R.</p>
<p>To date, I&#8217;ve collected just over 1200 Canabalt &#8216;events&#8217;. I will likely turn this into a web app if there&#8217;s enough interest.</p>
<p>A couple of summaries:</p>
<p>scores by device type:</p>
<pre class="brush: plain; title: ; notranslate">
      device count mean stddev median   max min range
      iPhone   735 4491   3882 3419.0 36332 102 36230
        iPad   284 4723   3884 4041.5 40630 104 40526
  iPod touch   189 3734   3644 2713.0 28024 102 27922&gt;
</pre>
<p>scores by type of death:</p>
<pre class="brush: plain; title: ; notranslate">
                                            death count mean stddev median   max  min range
          hitting a wall and tumbling to my death   684 4155   3481 3319.5 36332  102 36230
                           missing another window   243 5898   4981 4486.0 40630  409 40221
                         turning into a fine mist    86 3592   2698 2662.5 16441  614 15827
            colliding with some enormous obstacle    40 4768   4247 3256.5 16933  433 16500
                              falling to my death    37 4176   3160 3619.0 13573  567 13006
                       missing a crane completely    22 2950   1774 2923.5  7883  381  7502
                         knocking a building down    21 3399   2267 2849.0  8374  336  8038
                   not quite reaching a billboard    19 3098   1244 2980.0  5772  444  5328
              landing where a building used to be    17 4804   4970 3631.0 22685 1170 21515
          somehow hitting the edge of a billboard    14 5991   3827 5518.5 13547  566 12981
   just barely stumbling out of the first hallway    13  104      1  104.0   104  102     2
              somehow hitting the edge of a crane     7 5497   4835 4942.0 13275  510 12765
       riding a falling building all the way down     4 4278   2162 4195.5  6993 1727  5266
           completely  missing the entire hallway     1 1046     NA 1046.0  1046 1046     0
</pre>
<p>And now, in the spirit of killing the almighty ink-data ratio, here are some pictures:<br />
<img class="alignnone size-full wp-image-503" title="overall plot of scores" src="http://www.neilkodner.com/wp-content/uploads/2011/02/canabaltscores.png" alt="plot of scores" width="619" height="630" /></p>
<p><a href="http://www.neilkodner.com/wp-content/uploads/2011/02/bydeathfactedbytype11.png"><img class="alignnone size-large wp-image-505" title="by death faceted by device type" src="http://www.neilkodner.com/wp-content/uploads/2011/02/bydeathfactedbytype11-1024x779.png" alt="by death faceted by device type" width="717" height="545" /></a></p>
<p><a href="http://www.neilkodner.com/wp-content/uploads/2011/02/scores-by-device.png"><img class="alignnone size-full wp-image-507" title="scores by device" src="http://www.neilkodner.com/wp-content/uploads/2011/02/scores-by-device.png" alt="scores by device" width="534" height="539" /></a></p>
<p><a href="http://www.neilkodner.com/wp-content/uploads/2011/02/bydeathtype.png"><img class="alignnone size-large wp-image-509" title="bydeathtype" src="http://www.neilkodner.com/wp-content/uploads/2011/02/bydeathtype-1024x641.png" alt="by death type" width="614" height="385" /></a></p>
<p>What have we learned? So far, while my data set isn&#8217;t altogether that large(1200 events), we might have enough to make some basic observations and assumptions(correction please!). Going into this experiment I thought that iPad players would have generally higher scores. This is because of #1 the larger screen size and #2 players wouldn&#8217;t necessarily be playing &#8216;on-the-go&#8217; as they would be (I know I am) on an iPhone or iPod touch. The iPad has higher median and average scores than the other devices. I&#8217;d like to revisit this as I collect more data.</p>
<p>The leading cause of Canabalt death, by far, is hitting a wall and tumbling to one&#8217;s death. This surprised me as I thought it would be falling to death &#8211; that&#8217;s how my Canabalt games seem to end.</p>
<p>I&#8217;d like to hear your comments suggestions for new analysis, and most of all, your corrections.  You know who you are and this is how I learn. The data and python/R source can be found on <a href="https://github.com/neilkod/canabalt">github</a>.</p>
<p>The stack: Twitter Streaming API, EC2, MongoDB, Python, Regular Expressions, R</p>
<p>Things I learned working on this: <a href="http://had.co.nz/plyr/">plyr</a>(group-by and aggregation in R), sorting dataframes in R, couple of new <a href="http://had.co.nz/ggplot2/">ggplot2</a> tricks.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.neilkodner.com/2011/02/visualizations-of-canabalt-scores-scraped-from-twitter/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>Word Cloud from 6,500 tweets mentioning Kayne West.  From this morning</title>
		<link>http://www.neilkodner.com/2010/12/word-cloud-from-6500-tweets-mentioning-kayne-west-from-this-morning/</link>
		<comments>http://www.neilkodner.com/2010/12/word-cloud-from-6500-tweets-mentioning-kayne-west-from-this-morning/#comments</comments>
		<pubDate>Tue, 14 Dec 2010 22:25:32 +0000</pubDate>
		<dc:creator>admin</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[analysis]]></category>
		<category><![CDATA[data]]></category>
		<category><![CDATA[kanye]]></category>
		<category><![CDATA[kanyewest]]></category>
		<category><![CDATA[programming]]></category>
		<category><![CDATA[python]]></category>
		<category><![CDATA[twitter]]></category>

		<guid isPermaLink="false">http://www.neilkodner.com/?p=403</guid>
		<description><![CDATA[After removing a few stopwords and then clearing out a few other words(nowplaying, lastfm, and the like), here&#8217;s what&#8217;s left.  The data represents a half-day&#8217;s worth of tweets.   I&#8217;m sitting on about 90,000 tweets about Kanye and am looking forward to taking the time for some more in-depth analysis.  Huge thanks to @jrlevine and [...]]]></description>
			<content:encoded><![CDATA[<p>After removing a few <a href="http://www.neilkodner.com/stopwords.txt">stopwords</a> and then clearing out a few other words(nowplaying, lastfm, and the like), here&#8217;s what&#8217;s left.  The <a href="http://www.neilkodner.com/kanyetoday.txt">data</a> represents a half-day&#8217;s worth of tweets.   I&#8217;m sitting on about 90,000 tweets about Kanye and am looking forward to taking the time for some more in-depth analysis.  Huge thanks to <a href="http://www.twitter.com/#!/jrlevine">@jrlevine</a> and <a href="http://www.twitter.com/#!/alexmr">@alexmr</a> from <a href="http://www.twordsie.com">twordsie.com</a> for curating the awesome stopwords list, which I found in their <a href="https://github.com/jakelevine/twordsie">github project</a>.</p>
<p><a href="http://www.neilkodner.com/wp-content/uploads/2010/12/kanye-word-cloud.png"><img class="alignnone size-large wp-image-404" title="kanye word cloud" src="http://www.neilkodner.com/wp-content/uploads/2010/12/kanye-word-cloud-1024x447.png" alt="kayne west tweets word cloud" width="1024" height="447" /></a></p>
]]></content:encoded>
			<wfw:commentRss>http://www.neilkodner.com/2010/12/word-cloud-from-6500-tweets-mentioning-kayne-west-from-this-morning/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Words mentioned in 23-Jun-2010 Canadian Earthquake tweets</title>
		<link>http://www.neilkodner.com/2010/06/words-mentioned-in-23-jun-2010-earthquake-tweets/</link>
		<comments>http://www.neilkodner.com/2010/06/words-mentioned-in-23-jun-2010-earthquake-tweets/#comments</comments>
		<pubDate>Thu, 24 Jun 2010 16:15:33 +0000</pubDate>
		<dc:creator>admin</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[data]]></category>
		<category><![CDATA[earthquake]]></category>
		<category><![CDATA[mapreduce]]></category>
		<category><![CDATA[news]]></category>
		<category><![CDATA[python]]></category>
		<category><![CDATA[twitter]]></category>

		<guid isPermaLink="false">http://www.neilkodner.com/?p=256</guid>
		<description><![CDATA[Using twitter gardenhose access, remove stopwords and punctuation sprinkle in a little bit of mapping, some reducing, and voila! The most frequently-occurring words in tweets that mentioned earthquake from June 23, 2010. I left earthquake out of the image itself because being that it was in every tweet, it overwhelmed the rest of the words. [...]]]></description>
			<content:encoded><![CDATA[<div class="wp-caption alignnone" style="width: 1011px"><img class=" " title="words mentioned in earthquake tweets 23-jan-2010" src="http://www.neilkodner.com/images/littlesnapper/words%20mentioned%20in%20yesterdays%20earthquake%20tweets.png" alt="" width="1001" height="599" /><p class="wp-caption-text">words mentioned in earthquake tweets 23-jan-2010 </p></div>
<p>Using twitter gardenhose access, remove stopwords and punctuation sprinkle in a little bit of mapping, some reducing, and voila! The most frequently-occurring words in tweets that mentioned earthquake from June 23, 2010. I left earthquake out of the image itself because being that it was in every tweet, it overwhelmed the rest of the words.  I find it amazing that the most frequently occurring &#8216;word&#8217; is RT.</p>
<p>Also, wordle seemed to strip out numeric &#8216;words&#8217; which is a shame because people tweeted the magnitude left-and-right.  See the data below for the top 100 words.</p>
<p><span id="more-256"></span></p>
<pre>
<div id="_mcePaste">
<pre>
survey:79
hey:79
seriously:80
preliminary:81
info:81
strikes:81
hell:82
4.5:82
gta:82
geological:83
magnitude50:83
check:83
call:84
service:86
globeandmail:86
triggered:87
experience:88
video:89
earthquakes:90
guess:91
caused:92
pm:92
fuck:93
ground:94
bad:94
5.7:94
move:96
minutes:98
2.3:98
damn:99
mini:99
eastern:100
scared:100
will:100
philippec:101
cool:103
northern:104
live:106
struck:107
city:112
work:112
floor:113
epicenter:113
pretty:113
nyc:113
seconds:113
todays:113
afternoon:114
feeling:116
safe:116
haha:117
central:117
tremor:117
whoa:118
downtown:120
rattles:121
warning:121
damage:122
separating:125
guys:126
god:126
tweets:129
heard:129
tremors:130
north:131
rochester:131
earth:134
small:135
california:136
missed:137
cp24:138
finally:139
minor:139
fake:141
scary:141
detroit:143
big:146
breaking:147
coming:151
good:152
weird:156
area:159
experienced:161
lake:161
buffalo:163
happened:163
2010:164
survived:168
hope:172
thing:172
evacuated:173
canadian:176
cleveland:176
tsunami:177
region:178
office:182
shake:184
reported:185
ontarioquebec:187
buildings:193
hits:194
ago:200
house:200
york:200
ohio:205
going:207
border:213
shit:214
usgs:215
quake:216
michigan:217
5.0:222
omg:238
time:240
reports:241
holy:241
southern:242
shook:249
twitter:252
day:253
crazy:255
shakes:270
people:276
wtf:308
building:309
ny:355
thought:362
tornado:375
montreal:375
hit:388
lol:406
quebec:414
wow:418
shaking:423
news:433
g20:490
today:577
magnitude:612
ontario:781
5.5:988
ottawa:1046
canada:1373
feel:1439
toronto:2086
felt:2146
rt:4046
earthquake:14918</pre>
</div>
<pre class="brush: plain; title: ; notranslate">survey:79hey:79seriously:80preliminary:81info:81strikes:81hell:824.5:82gta:82geological:83magnitude50:83check:83call:84service:86globeandmail:86triggered:87experience:88video:89earthquakes:90guess:91caused:92pm:92fuck:93ground:94bad:945.7:94move:96minutes:982.3:98damn:99mini:99eastern:100scared:100will:100philippec:101cool:103northern:104live:106struck:107city:112work:112floor:113epicenter:113pretty:113nyc:113seconds:113todays:113afternoon:114feeling:116safe:116haha:117central:117tremor:117whoa:118downtown:120rattles:121warning:121damage:122separating:125guys:126god:126tweets:129heard:129tremors:130north:131rochester:131earth:134small:135california:136missed:137cp24:138finally:139minor:139fake:141scary:141detroit:143big:146breaking:147coming:151good:152weird:156area:159experienced:161lake:161buffalo:163happened:1632010:164survived:168hope:172thing:172evacuated:173canadian:176cleveland:176tsunami:177region:178office:182shake:184reported:185ontarioquebec:187buildings:193hits:194ago:200house:200york:200ohio:205going:207border:213shit:214usgs:215quake:216michigan:2175.0:222omg:238time:240reports:241holy:241southern:242shook:249twitter:252day:253crazy:255shakes:270people:276wtf:308building:309ny:355thought:362tornado:375montreal:375hit:388lol:406quebec:414wow:418shaking:423news:433g20:490today:577magnitude:612ontario:7815.5:988ottawa:1046canada:1373feel:1439toronto:2086felt:2146rt:4046earthquake:14918</pre>
</pre>
]]></content:encoded>
			<wfw:commentRss>http://www.neilkodner.com/2010/06/words-mentioned-in-23-jun-2010-earthquake-tweets/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>An analysis of Oracle errors in the leaked 9/11 Pager Data</title>
		<link>http://www.neilkodner.com/2009/11/an-analysis-of-oracle-errors-in-the-leaked-911-pager-data/</link>
		<comments>http://www.neilkodner.com/2009/11/an-analysis-of-oracle-errors-in-the-leaked-911-pager-data/#comments</comments>
		<pubDate>Tue, 01 Dec 2009 01:59:40 +0000</pubDate>
		<dc:creator>admin</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[oracle]]></category>
		<category><![CDATA[python]]></category>
		<category><![CDATA[regex]]></category>
		<category><![CDATA[regexp]]></category>

		<guid isPermaLink="false">http://www.neilkodner.com/?p=84</guid>
		<description><![CDATA[Yes, you read that correctly. Here&#8217;s how it started: I&#8217;m working on some text analysis in Python and was looking for some test data. Someone recommended I use the 9/11 Pager Data from Wikileaks. I downloaded the data, ran my program against it (which is the subject of another post) and all was well. Got [...]]]></description>
			<content:encoded><![CDATA[<p>Yes, you read that correctly.  Here&#8217;s how it started:</p>
<p>I&#8217;m working on some text analysis in Python and was looking for some test data.  Someone recommended I use the <a href="http://911.wikileaks.org/">9/11 Pager Data from Wikileaks</a>.  I downloaded the data, ran my program against it (which is the subject of another post) and all was well.  Got some great insight and I&#8217;ll share that later.</p>
<p>I then started browsing the raw data in vi.  After paging down a few times, what did I see?</p>
<p><img src="http://www.neilkodner.com/wp-content/uploads/2009/11/oracle-error.jpg" alt="oracle error" title="oracle error" width="808" height="48" class="alignnone size-full wp-image-85" /></p>
<p>Paging down some more yielded this gem:</p>
<p><img src="http://www.neilkodner.com/wp-content/uploads/2009/11/another-error.jpg" alt="another error" title="another error" width="849" height="143" class="alignnone size-full wp-image-88" /></p>
<p>The gears are now spinning&#8230;<br />
<a href="http://twitter.com/neilkod/status/6219088850"><img src="http://www.neilkodner.com/wp-content/uploads/2009/11/dork.jpg" alt="dork" title="dork" width="584" height="185" class="alignnone size-full wp-image-101" /></a></p>
<p>I wondered how many of these Oracle errors polluted the NYC messaging system.  Lets find out &#8211; Python to the rescue!</p>
<pre class="brush: plain; title: ; notranslate">
Error		Frequency		Description
ORA-00255	1	error archiving log %s of thread %s, sequence # %s
ORA-00333	1	redo log read error block %s count %s
ORA-00334	1	archived log: '%s'
ORA-01035	1	ORACLE only available to users with RESTRICTED SESSION privilege
ORA-01089	1	immediate shutdown in progress - no operations are permitted
ORA-01401	1	inserted value too large for column
ORA-01410	1	invalid ROWID
ORA-01652	1	unable to extend temp segment by %s in tablespace %s
ORA-01722	1	invalid number
ORA-02050	1	transaction %s rolled back, some remote DBs may be in-doubt
ORA-02068	1	following severe error from %s%s
ORA-03114	1	not connected to ORACLE
ORA-1146	1	cannot start online backup - file %s is already in backup
ORA-12154	1	TNS:could not resolve the connect identifier specified
ORA-1534	1	rollback segment '%s' doesn't exist
ORA-1537	1	cannot add file '%s' - file already part of database
ORA-1553	1	MAXEXTENTS must be no smaller than the %s extents currently allocated
ORA-1593	1	command no longer valid, see ALTER USER
ORA-19502	1	write error on file \&quot;%s\&quot;, blockno %s (blocksize=%s)
ORA-20012	1	User-defined
ORA-24324	1	service handle not initialized
ORA-27063	1	number of bytes read/written is incorrect
ORA-7445	1	exception encountered: core dump [%s] [%s] [%s]
ORA-00312	2	online log %s thread %s: '%s'
ORA-10		2	no data found
ORA-11		2	invalid value %s for attribute %s, must be between %s and %s
ORA-16038	2	log %s sequence# %s cannot be archived
ORA-20000	2	The stored procedure 'raise_application_error'
ORA-301		2	error in adding log file '%s' - file cannot be created
ORA-959		2	tablespace '%s' does not exist
ORA-00060	3	deadlock detected while waiting for resource
ORA-07445	3	exception encountered: core dump [%s] [%s] [%s]
ORA-12012	3	error on auto execute of job %s
ORA-00600	4	internal error code, arguments: [%s], [%s], [%s]
ORA-1652	4	unable to extend temp segment by %s in tablespace %s
ORA-00917	10	missing comma
ORA-01013	12	user requested cancel of current operation
ORA-1650	12	unable to extend rollback segment %s by %s in tablespace %s
ORA-20011	21	User-defined error: Execute_system: Err
ORA-1142	33	cannot end online backup - none of the files are in backup
</pre>
<p>Final analysis?  Where can I send my resume? </p>
<p>The Python code is simple &#8211; it loops through each line of the 49MB file (448k lines) and checks for an Oracle error using the regexp ORA-[0-9]{1,5} which I intended to mean the letters ORA, followed by a dash, followed by between one and five numbers.  Please feel free to correct/improve my regex-fu.  If a match is found, then add it to a dictionary as the key, and set the value to the count.  If the key is already present in the dictionary, the value gets incremented.  Finally, the contents of the dictionary are displayed, sorted by the value(frequency).</p>
<pre class="brush: python; title: ; notranslate">
#!/usr/bin/python
import re
f=open('messages_all.txt')

pattern = re.compile(r'ORA-[0-9]{1,5}')
errors={}
for line in f:
	err = re.findall(pattern,line)
	if err:
		errors[err[0]] = errors.get(err[0],0)+1
f.close()

for k,v in sorted(errors.items(), key=lambda(k,v):(v,k)):
	print '%s\t%d' % (k,v)
</pre>
]]></content:encoded>
			<wfw:commentRss>http://www.neilkodner.com/2009/11/an-analysis-of-oracle-errors-in-the-leaked-911-pager-data/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Generating multiple Oracle TKPROF reports using Python</title>
		<link>http://www.neilkodner.com/2009/11/generating-multiple-oracle-tkprof-reports-using-python/</link>
		<comments>http://www.neilkodner.com/2009/11/generating-multiple-oracle-tkprof-reports-using-python/#comments</comments>
		<pubDate>Tue, 24 Nov 2009 17:44:58 +0000</pubDate>
		<dc:creator>admin</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[dba]]></category>
		<category><![CDATA[oracle]]></category>
		<category><![CDATA[performance]]></category>
		<category><![CDATA[python]]></category>

		<guid isPermaLink="false">http://www.neilkodner.com/?p=76</guid>
		<description><![CDATA[Recently, a customer told me that they felt a batch job was taking too long each night, I gave them a few commands to add to their nightly run. These commands named the tracefile and enabled 10046 logging. Since I&#8217;m lazy(the good kind), I figured I&#8217;d use Python to build the commands to run TKPROF [...]]]></description>
			<content:encoded><![CDATA[<p>Recently, a customer told me that they felt a batch job was taking too long each night, I gave them a few commands to add to their nightly run.</p>
<pre class="brush: sql; title: ; notranslate">
alter session set tracefile_identifier='charging_batch';
exec dbms_monitor.session_trace_enable;
</pre>
<p>These commands named the tracefile and enabled 10046 logging.</p>
<p>Since I&#8217;m lazy(the good kind), I figured I&#8217;d use Python to build the commands to run TKPROF for each process.  The program expects to be run from the udump directory.  As I get more time I&#8217;ll enhance it to automatically grab the location of udump from the database.</p>
<p>The script takes an optional parameter for a tracefile identifier.  If the parameter is passed, filenames containing the identifier text will be processed.  Otherwise, all tracefiles are processed.  A to-do item is to make sure the tracefile is an actual 10046 before running TKPROF against it.</p>
<p>The output format is the optional tracefile identifier_process_id.out.  The file suffix can be overridden with variable tkprof_suffix.  I use .out as an homage to Michael Levy, wherever he may be, who showed me how to use the tool way back in 1998.</p>
<pre class="brush: python; title: ; notranslate">
#!/usr/local/bin/python
# tkprof.py
# kodner 2009
# and runs a simple tkprof on them
import sys
import os
import re
sort = &quot;fchqry&quot; #parameterize this
tkprof_suffix = 'out' #this too

# find a string to be used as a tracefile identifier
# to limit the tracefiles processed
try:
  tracefile_identifier = sys.argv[1]
  print &quot;&quot;
  print &quot;&quot;
  print &quot;tracefile identifier supplied is: %s&quot; % (tracefile_identifier)
  print &quot;&quot;
  print &quot;&quot;
except:
  tracefile_identifier = None

# lists the files with suffix .trc and contain out suffix .trc
traces=[x for x in os.listdir('.') if x.endswith('.trc')]

for file in traces:
  tracefile = None

  # extract the process id from the filename.
  # I'm sure this could be done better.  i split it into multiple
  # lines for readability.

  processNum = re.findall(r'ora_[0-9]+',file)
  processNum = processNum[0].split('_')[1]

  # if a tracefile_identifier is supplied then make sure our current file
  # contains the string.  we'll also make sure the output filename contains
  # the tracefile identifier.

  if tracefile_identifier:
    if file.find(tracefile_identifier) &gt; 0:
      tracefile = file
      outputfile=tracefile_identifier + '_' + processNum + '.' + tkprof_suffix
  else:
    tracefile=file
    outputfile=processNum + '.' + tkprof_suffix

  if tracefile:
    print &quot;processing tracefile %s ...&quot; % (tracefile)

    # using regexp, find the process number of the file.
    # the process number will be used to name the tkprof output file

    # we will assume that the tracefile name is in the format
    # $ORACLE_SID_ora_$PROCESSNUM.trc
    # and that the tracefile name may contain a tracefile identifier
    # set by using alter session set tracefile_identifier = 'foo';

    # generate the tkprof command use flags sys=no and waits=yes
    command=&quot;tkprof %s %s sys=no waits=yes sort=%s&quot; % (tracefile,outputfile,sort)

    # execute the command
    os.system(command)
</pre>
]]></content:encoded>
			<wfw:commentRss>http://www.neilkodner.com/2009/11/generating-multiple-oracle-tkprof-reports-using-python/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>

