<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>neilkodner.com &#187; analysis</title>
	<atom:link href="http://www.neilkodner.com/tag/analysis/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.neilkodner.com</link>
	<description>Data Driven.  Since 1971.</description>
	<lastBuildDate>Sun, 23 Oct 2011 16:40:04 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3.1</generator>
		<item>
		<title>An analysis of Steve Jobs tribute messages displayed by Apple</title>
		<link>http://www.neilkodner.com/2011/10/an-analysis-of-steve-jobs-tribute-messages-displayed-by-apple/</link>
		<comments>http://www.neilkodner.com/2011/10/an-analysis-of-steve-jobs-tribute-messages-displayed-by-apple/#comments</comments>
		<pubDate>Thu, 20 Oct 2011 21:08:26 +0000</pubDate>
		<dc:creator>admin</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[analysis]]></category>
		<category><![CDATA[apple]]></category>
		<category><![CDATA[data]]></category>
		<category><![CDATA[datamining]]></category>
		<category><![CDATA[nlp]]></category>
		<category><![CDATA[nltk]]></category>
		<category><![CDATA[python]]></category>
		<category><![CDATA[stevejobs]]></category>

		<guid isPermaLink="false">http://www.neilkodner.com/?p=569</guid>
		<description><![CDATA[Two weeks have passed since Apple&#8217;s Co-Founder/CEO Steve Jobs passed away.  Upon his passing, Apple encouraged people to share their memories, thoughts, and feelings by emailing rememberingsteve@apple.com. Earlier this week, Apple posted a site (http://www.apple.com/stevejobs) in tribute to Steve Jobs. According to the site, over a million people have submitted messages. The site cycles through the submitted [...]]]></description>
			<content:encoded><![CDATA[<p>Two weeks have passed since Apple&#8217;s Co-Founder/CEO Steve Jobs passed away.  Upon his passing, Apple encouraged people to share their memories, thoughts, and feelings by emailing <a href="https://mail.google.com/mail/?view=cm&amp;fs=1&amp;tf=1&amp;to=rememberingsteve@apple.com" target="_blank">rememberingsteve@apple.com</a>. Earlier this week, Apple posted a <a href="http://www.apple.com/stevejobs/" target="_blank">site</a> (<a href="http://www.apple.com/stevejobs/" target="_blank">http://www.apple.com/stevejobs</a>) in tribute to Steve Jobs. According to the site, over a million people have submitted messages. The site cycles through the submitted messages.</p>
<p>I decided to take a closer look at what people are saying about Steve Jobs, as a whole. Looking at how the site updates, it appears to use Ajax to retrieve and display new messages. Using Chrome&#8217;s developer tools, I monitored the requests it was making to get the new messages.</p>
<p><a href="http://www.neilkodner.com/wp-content/uploads/2011/10/Apple-Remembering-Steve-Jobs2.png"><img class="alignnone size-large wp-image-574" title="Apple - Remembering Steve Jobs" src="http://www.neilkodner.com/wp-content/uploads/2011/10/Apple-Remembering-Steve-Jobs2-1024x892.png" alt="" width="819" height="714" /></a><br />
Once I found the location of the individual messages, it was trivial to download all of them. The message endpoint URLs are in the format</p>
<pre class="brush: xml; title: ; notranslate">

http://www.apple.com/stevejobs/messages/3679.json?28106802
</pre>
<p>and a sample message looks like</p>
<pre class="brush: jscript; title: ; notranslate">
{
mainText: &quot;This is equivalent to my mom's generation of Elvis dying for me. I am very
sadden and emotionally moved at the moment. He was more influential on my
life than my parents and friends. While my parents loved me and friends
shared fun times. Steve influenced me, motivated me to become the innovated,
creative technologist I have become. I got into computer technology in 1980
and moved to Silicon Valley because of him. I have been one of his biggest
admirers and looked to him as a mentor to push the boundaries of my own
creative abilities to develop technology solutions which I hope made a
difference and impact to the industries I worked in. We've lost a
significant influence and icon in technology. We won't see another person of
his innovation and foresight within my life time. He was the Edison of
technology. He was and is one of my biggest inspirations.

I feel I have lost a close family member&quot;
header: &quot;What Steve Jobs meant to me&quot;
author: &quot;Skip&quot;
location: &quot;&quot;
}
</pre>
<p>The site makes a request to <a href="http://www.apple.com/stevejobs/messages/main.json" target="_blank">http://www.apple.com/stevejobs/messages/main.json</a> which returns</p>
<pre class="brush: jscript; title: ; notranslate">
 {
 totalMessages: &quot;10975&quot;
 timestamp: &quot;28106802&quot;
 }
</pre>
<p>So it appears that it cycles through 10975 messages. I didn&#8217;t decompose the javascript powering the site to determine this, I just made an assumption. I tried querying values greater than 10975 and they returned 404. I wrote a quick python program to download the messages:</p>
<pre class="brush: python; title: ; notranslate">
#!/usr/bin/python
import urllib2
import simplejson as json
import time
import codecs

# a page on apple's site shows the # of messages available
# start with 0 and retrieve up to message_range messages
metadata = json.loads(urllib2.urlopen('http://www.apple.com/stevejobs/messages/main.json').read())
message_range = metadata['totalMessages']

# the url for each message. i learned of this url by inspecting
# the network calls to http://www.apple.com/stevejobs
# using chrome's developer tools
url=&quot;http://www.apple.com/stevejobs/messages/%d.json&quot;

# create our destination file
# i'm using codecs because it does a better job at handling international characters
output_file = 'stevejobs_tribute.txt'
file_handle = codecs.open(output_file,'w','utf-8')

# helper function to remove tabs and linefeeds
def clean(txt):
  return txt.replace('\n','').replace('\t','')

# iterate from 0 to the max # of messages and download the message text
# for these purposes, I'm ignoring the other fields as they weren't always present
for i in range(0, message_range):
  req = url % i
  data = urllib2.urlopen(req).read()
  data = json.loads(data)
  file_handle.write(clean(data['mainText']) + '\n')
file_handle.close()
</pre>
<p><span style="direction: ltr;"><br />
</span><br />
<span style="direction: ltr;">So now, we have over ten thousand tribute messages saved to the file <a href="https://github.com/neilkod/steve_jobs_tribute_messages/tree/master/data">stevejobs_tribute.txt</a>. What I was most interested in seeing how many of these messages contain a reference to a certain Apple product.</span><br />
I came up with a few search terms based on some legendary Apple product names including</p>
<ul>
<li>Newton</li>
<li>Macintosh</li>
<li>MacBook</li>
<li>iBook</li>
<li>Mac</li>
<li>iPhone</li>
<li>iPod</li>
<li>iMac</li>
<li>iPad</li>
<li>Apple II family</li>
<li>OSX</li>
<li>iMovie</li>
<li>Apple TV</li>
<li>iTunes</li>
<li>LaserWriter (yes, <a href="http://en.wikipedia.org/wiki/LaserWriter" target="_blank">Laserwriter</a>)</li>
</ul>
<div>Each product received an entry in a python dictionary. The value is another dictionary containing a regex for the product name and a count for the running totals. Some of the regular expressions are as simple as testing for an optional s at the end of the product name, some are a little more complex &#8211; check the Apple II regular expression to match all of entire product Apple 2 line. As I&#8217;m ok but not great with regular expressions, I welcome your corrections.</div>
<pre class="brush: python; title: ; notranslate">
products = {'iPhone':{'regex':'iphones?','count':0},
	'iMac':{'regex':'imacs?','count':0},
	'iPad':{'regex':'ipads?','count':0},
	'iTunes':{'regex':'itunes','count':0},
	'iPod':{'regex':'ipods?','count':0},
	'cube':{'regex':'cubes?','count':0},
	'MacBook':{'regex':'macbooks?','count':0},
	'iBook':{'regex':'ibooks?','count':0},
	'Apple TV':{'regex':'apple ?tvs?','count':0},
	'Apple II Family':{'regex':r'(apple )?(2|ii|\]\[|\/\/)([ce\+|]|gs|s)?[^0-9]', 'count':0},
	'LaserWriter':{'regex':'laserwriter?','count':0},
	'PowerBook':{'regex':'powerbook?','count':0},
	'Newton':{'regex':'newton?','count':0},
	'OSX':{'regex':'osx','count':0},
	'iMovie':{'regex':'imovie','count':0},
	'Macintosh':{'regex':'macintosh','count':0},
	'Lisa':{'regex':'lisa','count':0},
	'Mac':{'regex':'mac','count':0},
}
</pre>
<p>Here&#8217;s a screenshot of me testing the Apple II regular expression, using the excellent <a href="http://gskinner.com/RegExr/" target="_blank">Regexr</a>.</p>
<p><a href="http://www.neilkodner.com/wp-content/uploads/2011/10/apple-2-regex-testing.png"><img class="alignnone size-full wp-image-623" title="apple 2 regex testing" src="http://www.neilkodner.com/wp-content/uploads/2011/10/apple-2-regex-testing.png" alt="" width="424" height="388" /></a></p>
<p>Overall, out of 10975 messages downloaded(as of now), 2,186, or just under 20% mentioned an apple product by name. Here&#8217;s the breakdown of the products mentioned:</p>
<pre class="brush: plain; title: ; notranslate">
LaserWriter        1
iMovie             3
OSX                9
iBook             22
PowerBook         22
Lisa              24
Apple TV          31
Newton            33
iTunes            52
Macintosh        163
iMac             235
MacBook          366
Apple II Family  481
iPad             574
iPod             575
iPhone           875
Mac             1315
</pre>
<p>More than one out of every ten messages included a reference to a Mac! Nearly one in ten mentioned an iPhone &#8211; not bad for a device that&#8217;s been out a fraction of the time the Mac has been available.I&#8217;m pleased to see so many references to the Apple II including several mentions of the//c, which was my first Apple product.</p>
<p>It&#8217;s also interesting to note that out of 33 mentions of Newton, only a handful of those were about the actual Apple product &#8211; most were comparing Steve Jobs to Newton himself. Check out my <a href="http://www.neilkodner.com/2010/10/fun-with-nltk-and-zoolander-part-1-concordance/" target="_blank">earlier post on NLTK concordance</a> for details on how I did this:</p>
<pre class="brush: python; title: ; notranslate">
import nltk
import string
f = open('stevejobs_tribute.txt').read()
f = f.translate(string.maketrans(&quot;&quot;,&quot;&quot;), string.punctuation)
foo=nltk.Text(f.split())
print foo.concordance('newton')
</pre>
<p>result:</p>
<pre class="brush: plain; title: ; notranslate">
op If history misses men like Isaac Newton Graham Bell Galileu Thomas Edison a
mbered though his legacy Now he met Newton Einstein and other geniuses like hi
oday I was one of the few who had a Newton Today I have an iPhone 4 an iPad2 a
oduct that came thereafter from the Newton to the Cube to the iPhone 4S God Bl
with the likes of Edison Garcia and Newton for his impact and vision I wish hi
ntioned in the same breath as Isaac Newton Thomas Edison and Bill Gates The le
 off a tree we are thinking of Adam Newton and Steve Jobs He open new dimensio
Jobs will be missed Da Vinci Mozart Newton Franklin Jobs Nobody is out of plac
ged my life starting with the Apple Newton followed by the iPod and then the i
 sorely missed nbsp Da Vinci Mozart Newton Franklin Jobs Nobody is out of plac
ve dared to Einstein Freud Da Vinci Newton Galileo Darwin among others is prou
embered beside Einstein Pasteur and Newton The world is moving toward his crea
irst Apple Mac I remember the first Newton I willnbspremembernbspSteves creati
e to contact us againnbsp How Isaac Newton and Albert Einstein contributed gre
 world One seduced Eve One awakened Newton and One was in the hands of Steve J
the way you have influenced mine If Newton discovered something as remarkable
rld One seduced Eve second awakened Newton the third one was in the hands of S
lent to Leonardo Da Vinci Sir Issac Newton Albert Einstein and the like He was
t of the caliber of that of DaVinci Newton Pythagorous etc The list can go on
hen people say names like ie Edison Newton and Einstein I guarantee that the n
 Computers” The Apple II Lisa Mac Newton iPod iTunes store iPod Touch iPhone
ember Steve Jobs the way I remember Newton or Einstein I lived with Apple prod
set consultant who bought his first Newton MacBook 170 and all the dozens of o
 br 3 Apples change the world Adán Newton Steve Jobs 19552011 Rest in Peace t
back to the Apple IIGS I also had a Newton Steve Jobs death hurts me personall
ed the world apple to adam apple to newton and apple to steve jobs Steve was a
dam and Eva Second one that wake up newton third one that Steve Jobs create St
</pre>
<p>Also interesting where the number of mentions to other historical figures in the Steve Job remembrance messages. According to the submitters, Steve Jobs is clearly in some elite company. I don&#8217;t know if I&#8217;d go so far as to group him with the man who brought automobiles and light bulbs to the masses but hey, we all have our priorities. All counts were determined through a simple grep command piped to wc -l.Here are a few examples:</p>
<ul>
<li>Einstein &#8211; 70</li>
<li>Ford &#8211; 189</li>
<li>Edison &#8211; 110</li>
<li>DaVinci &#8211; 15</li>
<li>Bill Gates &#8211; 8</li>
</ul>
<p>Finally, I wanted to see what how people were speaking about Steve Jobs and especially what terms were being used to describe him. There was no point in performing sentiment analysis on this text as all of the texts were not only obviously positive but were also vetted by Apple for content. Using NLTK, I performed part-of-speech tagging on every word in each tribute message and then wrote some code to total the adjectives and adverbs used in the tribute messages.</p>
<p>The most commonly-used adjectives are</p>
<pre class="brush: plain; title: ; notranslate">
('great', 1961)
('steve', 1808)
('many', 1459)
('first', 917)
('sad', 862)
('better', 857)
('such', 727)
('best', 721)
('visionary', 645)
('new', 579)
('more', 556)
('true', 538)
('most', 476)
('creative', 471)
('apple', 435)
('other', 427)
('same', 415)
('good', 412)
('greatest', 376)
('wonderful', 373)
('sorry', 362)
('old', 325)
('brilliant', 283)
('able', 281)
('incredible', 267)
('big', 260)
</pre>
<p>Humorously, NLTK frequently considered &#8220;Steve&#8221; to be an adjective. This is likely because it is always followed by the proper noun &#8220;Jobs.&#8221; A <a href="http://twitter.com/#!/japerk/status/127054008060878848">tweet</a> from <a href="http://www.streamhacker.com">NLTK expert Jacob Perkins</a> reminded me that machines are dumb and proper nouns should be capitalized. In order to aggregate the counts, I normalized the text by converting to lowercase &#8211; I wasn&#8217;t interested in nouns, only adjectives so proper nouns didn&#8217;t matter to me.<br />
The top adverbs, according to NLTK, were not as interesting, at least to me.</p>
<pre class="brush: plain; title: ; notranslate">
('so', 2220)
('never', 2111)
('not', 1897)
('always', 1798)
('just', 1402)
('now', 1028)
('truly', 989)
('only', 945)
('very', 919)
('much', 908)
('ever', 751)
('even', 743)
('really', 567)
('forever', 508)
('more', 486)
('still', 447)
('well', 398)
('most', 375)
('personally', 352)
</pre>
<p>And finally, I ran tri-gram analysis, again using NLTK.<span style="direction: ltr;"> </span></p>
<pre class="brush: python; title: ; notranslate">
trigrams = defaultdict(int)
nltk_trigrams = nltk.trigrams(text)
for itm in nltk_trigrams:
  trigrams[itm] += 1
</pre>
<p>As one would expect, the leading trigram was &#8216;<strong>rest in peace</strong>&#8216; with 1838 mentions, 16.7% of all mentions. &#8216;<strong>thank you for</strong>&#8216; was found in 1446 messages, &#8216;<strong>will be missed</strong>&#8216; was found in 827 messages. Other interesting trigrams are &#8216;<strong>thank you steve</strong>&#8216; with 791 mentions and &#8216;<strong>changed the world</strong>&#8216; with 551 mentions.</p>
<p>The full python code and resulting data can be found on <a href="https://github.com/neilkod/steve_jobs_tribute_messages" target="_blank">github</a>.</p>
<pre class="brush: python; title: ; notranslate">

#!/usr/bin/python
#nltk.help.upenn_tagset('RB')
from collections import defaultdict
from operator import itemgetter
import re
import urllib2
import string
import simplejson as json

import codecs
import nltk

OUTPUT_FILE = 'data/stevejobs_tribute.txt'

adverbs = defaultdict(int)
adjectives = defaultdict(int)
trigrams = defaultdict(int)

message_has_adjective = False
message_has_adverb = False
message_contains_product_mention = False
messages_with_adjective = 0
messages_with_adverb = 0
messages = 0
messages_with_product_mention = 0

exclude = set(string.punctuation)

products = {'iPhone':{'regex':'iphones?','count':0},
	'iMac':{'regex':'imacs?','count':0},
	'iPad':{'regex':'ipads?','count':0},
	'iTunes':{'regex':'itunes','count':0},
	'iPod':{'regex':'ipods?','count':0},
	'cube':{'regex':'cubes?','count':0},
	'MacBook':{'regex':'macbooks?','count':0},
	'iBook':{'regex':'ibooks?','count':0},
	'Apple TV':{'regex':'apple ?tvs?','count':0},
	'Apple II Family':{'regex':r'(apple )?(2|ii|\]\[|\/\/)([ce\+|]|gs|s)?[^0-9]', 'count':0},
	'LaserWriter':{'regex':'laserwriter?','count':0},
	'PowerBook':{'regex':'powerbook?','count':0},
	'Newton':{'regex':'newton?','count':0},
	'OSX':{'regex':'osx','count':0},
	'iMovie':{'regex':'imovie','count':0},
	'Macintosh':{'regex':'macintosh','count':0},
	'Lisa':{'regex':'lisa','count':0},
	'Mac':{'regex':'mac','count':0},
}

def top_n(dct,n = 10):
	srtd=sorted(dct.iteritems(), key=itemgetter(1), reverse=True)
	for x in srtd[0:n+1]:
		print x

def nltk_concordance(term,text_file):
	f = open(text_file).read()
	# remove punctuation
	f = f.translate(string.maketrans(&quot;&quot;,&quot;&quot;), string.punctuation)
	split_text=nltk.Text(f.split())
	split_text.concordance(term,lines=100)

	# &gt;&gt;&gt; f = f.translate(string.maketrans(&quot;&quot;,&quot;&quot;), string.punctuation)
	# &gt;&gt;&gt; foo=nltk.Text(f.split())
	# &gt;&gt;&gt; print foo.concordance('newton')

def unescape(s):
	&quot;&quot;&quot;unescapes html codes&quot;&quot;&quot;
	s = s.replace(&quot;&lt;&quot;, &quot;	s = s.replace(&quot; &quot;, &quot; &quot;)
	# this has to be last:
	s = s.replace(&quot;&amp;&quot;, &quot;&amp;&quot;)
	return s

for line in open(OUTPUT_FILE):
	message_has_adjective = False
	message_has_adverb = False
	message_contains_product_mention = False

	# remove the trailing linefeed and convert to lower-case
	# and remove html control characters
	messages += 1
	data = line.strip()
	data = data.lower()
	data = unescape(data)

	# check for product mentions
	for k,v in products.iteritems():
		if re.search(v['regex'],data):
			products[k]['count'] += 1
			message_contains_product_mention = True

	# if the message contains a product mention
	# increment the product mention counter
	if message_contains_product_mention:
		messages_with_product_mention += 1

# tokenize the sentences using nltk's wordpuncttokenizer
	text = nltk.WordPunctTokenizer().tokenize(data)

# compute trigrams
	nltk_trigrams = nltk.trigrams(text)
	for itm in nltk_trigrams:
		trigrams[itm] += 1

# pos-tag each token. we're interested in adjectives and adverbs
	parts_of_speech = nltk.pos_tag(text)
	# test for adjectives and adverbs, increment the counters
	# when we find one.

	for (word,pos) in parts_of_speech:
		if pos.startswith('JJ'):
			message_has_adjective = True
			adjectives[word] += 1

		if pos.startswith('RB'):
			message_has_adverb = True
			adverbs[word] += 1

	# if the message contains an adverb or an adjective, increment a counter
	if message_has_adjective:
		messages_with_adjective += 1
	if message_has_adverb:
		messages_with_adverb += 1

# output the 25 most frequently-used adjectives and adverbs
n = 25
print &quot;top %s adverbs&quot; % n
top_n(adverbs, n)
print
print &quot;top %s adjectives&quot; % n
top_n(adjectives, n)

print &quot;messages with adjectives: %s&quot; % messages_with_adjective
print &quot;messages with adverbs: %s&quot; % messages_with_adverb
print &quot;total messages with product mentions: %s&quot; % messages_with_product_mention
print &quot;total messages: %s&quot; % messages

# output the top 50 most-common trigrams
n = 50
print &quot;top %s trigrams&quot; % n
top_n(trigrams, n)
srtd=sorted(products.iteritems(),key=itemgetter(1))
for x,y in srtd:
	print &quot;%s\t\t%s&quot; % (x,y['count'])

print
print
# concordance for newton
print &quot;concordance for newton:&quot;
nltk_concordance('newton',OUTPUT_FILE)
</pre>
]]></content:encoded>
			<wfw:commentRss>http://www.neilkodner.com/2011/10/an-analysis-of-steve-jobs-tribute-messages-displayed-by-apple/feed/</wfw:commentRss>
		<slash:comments>52</slash:comments>
		</item>
		<item>
		<title>Fun with awk and dead people</title>
		<link>http://www.neilkodner.com/2011/02/fun-with-awk-and-dead-people/</link>
		<comments>http://www.neilkodner.com/2011/02/fun-with-awk-and-dead-people/#comments</comments>
		<pubDate>Thu, 24 Feb 2011 19:36:15 +0000</pubDate>
		<dc:creator>admin</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[analysis]]></category>
		<category><![CDATA[awk]]></category>
		<category><![CDATA[curl]]></category>
		<category><![CDATA[data]]></category>
		<category><![CDATA[freebase]]></category>

		<guid isPermaLink="false">http://www.neilkodner.com/?p=521</guid>
		<description><![CDATA[Just playing around with some Freebase data in preparation for a &#8216;who died today&#8217; twitter bot. Get the data and determine on which date did the most people die? Surprised to see 1965-11-08 listed ahead of 2001-09-11. Why? Lets look at where people died on 1965-11-08: Upon further investigation, it looks as if Freebasers have [...]]]></description>
			<content:encoded><![CDATA[<p>Just playing around with some <a href="http://www.freebase.com">Freebase</a> data in preparation for a &#8216;who died today&#8217; twitter bot.</p>
<p><strong>Get the data and determine on which date did the most people die?</strong></p>
<pre class="brush: bash; title: ; notranslate">

hadoop3:Downloads nkodner$ curl -O &quot;http://download.freebase.com/datadumps/latest/browse/people/deceased_person.tsv&quot;
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 16.3M  100 16.3M    0     0   209k      0  0:01:19  0:01:19 --:--:--  248k
hadoop3:Downloads nkodner$ awk -F'\t' '{print $4}' deceased_person.tsv|grep &quot;-&quot;|sort|uniq -c|sort -n|tail -11|head
  22 2008-01-03
  22 2008-02-21
  22 2008-05-20
  23 1989-06-07
  23 2009-01-13
  24 2009-01-11
  26 2009-04-03
  27 1912-04-15
  63 2001-09-11
  65 1965-11-08
</pre>
<p>Surprised to see 1965-11-08 listed ahead of 2001-09-11. Why? <strong>Lets look at where people died on 1965-11-08</strong>:</p>
<pre class="brush: bash; title: ; notranslate">
hadoop3:Downloads nkodner$ grep &quot;1965-11-08&quot; deceased_person.tsv |awk -F'\t' '{print $5}' |sort|uniq -c|sort -n
   1 Kenton County
   1 Latium
   1 Leicester
   1 New York City
   1 Toronto
   3
  57 American Airlines Flight 383 Crash Site
</pre>
<p>Upon further investigation, it looks as if Freebasers have set up a <a href="http://www.freebase.com/view/base/americanairlinesflight383/views/victims_of_aa_flight_383">Victims of AA Flight 383 page</a>, containing info on the deceased. Works for me.</p>
<p><strong>How about which month/year did the most people die on?</strong></p>
<pre class="brush: bash; title: ; notranslate">
hadoop3:Downloads nkodner$ awk -F'\t' '{print $4}' deceased_person.tsv|grep &quot;-&quot;|awk -F'-' '{print $2&quot;-&quot;$3}'|sort|uniq -c|sort -n|tail -11|head
 668 02-08
 668 03-06
 672 01-06
 673 02-11
 676 01-28
 677 01-10
 683 01-04
 692 12-31
 702 01-22
 752 02-02
</pre>
<p><strong>Method of death?</strong></p>
<pre class="brush: bash; title: ; notranslate">
hadoop3:Downloads nkodner$ awk -F'\t' '{print $3}' deceased_person.tsv|sort|uniq -c|sort -n|tail -11|head
 505 Cardiovascular disease
 603 Tuberculosis
 742 Assassination
 745 Stroke
 799 Pneumonia
 832 Lung cancer
 913 Murder
1618 Suicide
1978 Cancer
2503 Myocardial infarction
</pre>
<p><strong>And finally, the most common names of the deceased people listed on Freebase</strong></p>
<pre class="brush: bash; title: ; notranslate">
hadoop3:Downloads nkodner$ awk -F '\t' '{print $1}' deceased_person.tsv |sort|uniq -c|sort -n|tail -11|head
  21 William Anderson
  23 John White
  25 John Campbell
  25 John Wilson
  29 George Smith
  30 John Anderson
  32 William Smith
  34 John Williams
  35 John Taylor
  36 John Smith
</pre>
<p>Nothing too deep today, maybe this data might be worth a closer look in R someday.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.neilkodner.com/2011/02/fun-with-awk-and-dead-people/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Visualizations of Canabalt scores scraped from twitter</title>
		<link>http://www.neilkodner.com/2011/02/visualizations-of-canabalt-scores-scraped-from-twitter/</link>
		<comments>http://www.neilkodner.com/2011/02/visualizations-of-canabalt-scores-scraped-from-twitter/#comments</comments>
		<pubDate>Wed, 16 Feb 2011 22:56:46 +0000</pubDate>
		<dc:creator>admin</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[analysis]]></category>
		<category><![CDATA[api]]></category>
		<category><![CDATA[canabalt]]></category>
		<category><![CDATA[data]]></category>
		<category><![CDATA[python]]></category>
		<category><![CDATA[r]]></category>
		<category><![CDATA[twitter]]></category>

		<guid isPermaLink="false">http://www.neilkodner.com/?p=479</guid>
		<description><![CDATA[Canabalt, a ridiculously addicting web/IOS-device game allows one to show off their high scores, and their not-so-high scores to Twitter. Each of these tweets contains a bit of information &#8211; The score represented in meters, the method of death (hitting a wall and tumbling to my death) and the device (iPhone). Other useful information can [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://www.canabalt.com/">Canabalt</a>, a ridiculously addicting web/IOS-device game allows one to show off their high scores, and their <a href="http://twitter.com/#!/neilkod/status/37964035903324160">not-so-high scores</a> to Twitter.</p>
<p><a href="http://www.neilkodner.com/wp-content/uploads/2011/02/canabaltscore.png"><img class="alignnone size-medium wp-image-485" title="canabaltscore" src="http://www.neilkodner.com/wp-content/uploads/2011/02/canabaltscore-300x105.png" alt="" width="300" height="105" /></a></p>
<p>Each of these tweets contains a bit of information &#8211; The score represented in meters, the method of death (hitting a wall and tumbling to my death) and the device (iPhone). Other useful information can easily be extracted such as the date/time played and information about the user (name, location, friend count, follower count, etc). Over the next few weeks I aim to see what features, if any, has any influence on Canabalt scores.</p>
<p>The first thing I needed to do was capture the tweeted Canabalt scores. I have a process running on an EC2 micro instance that downloads tweets from the Twitter Streaming API based on certain key words, one of them being canabalt. The process loads each matching tweet into a MongoDB instance hosted on <a href="http://www.mongohq.com">MongoHQ.com</a>.</p>
<pre class="brush: bash; title: ; notranslate">

curl -s -u $TWITTER_USERNAME:$TWITTER_PASSWORD -d @/home/ec2-user/trackingkeywords http://stream.twitter.com/1/statuses/filter.json |/home/ec2-user/mongodb/bin/mongoimport &amp;
</pre>
<p>Where trackingkeywords is a file containing a comma-separated list of keywords that I track on twitter. Additionally, I left connection details out of the mongoimport command. You&#8217;ll need to provide a host, port, database, and collection into the mongoimport command.</p>
<p>I then run some python code to query the MongoDB instance and retrieve tweets mentioning Canabalt, based on a simple regular expression. I&#8217;m expecting the tweet to begin with &#8216;I&#8217; and contain the word Canabalt. Pretty naive but it worked fine. If it&#8217;s not a true Canabalt score, I&#8217;ll be able to determine in no time. From there, I use regular expressions to extract(for now) the score, the method of death, and the device name.</p>
<pre class="brush: python; title: ; notranslate">
def canabalt_tweets():

	# connect to MongoDB
	tweets = create_connection(False)

	# regular expression to extract components of a canabalt score
	canabalt_regexp = re.compile(r'I ran (\d{3,7})m before (.*) on my ([^.]+)\.')

	# regular expression to match tweets that begin with I ran and mention canabalt
	regexp = re.compile('^I ran .*canabalt')

	# create a MongoDB cursor(query)
	cur = tweets.conftweets.find({'text': regexp}, {'text': 1})

	# iterate through the cursor. If a tweet fits the pattern, print it.
	for item in cur:
		try:
			(score,death,device) = canabalt_regexp.search(item['text']).groups()
			print ','.join([strip_text(score),strip_text(death),strip_text(device)])
		except:
			pass
</pre>
<p>Function strip_text() is part of my data tools Bat-Utility Belt and cleans text by removing leading/trailing spaces, crlf, tabs and some other junk.</p>
<p>We now have some comma-separated data in this shape</p>
<pre class="brush: plain; title: ; notranslate">
score,death,device
2860,hitting a wall and tumbling to my death,iPhone
3427,hitting a wall and tumbling to my death,iPad
4496,hitting a wall and tumbling to my death,iPad
3635,missing another window,iPhone
2040,colliding with some enormous obstacle,iPhone
6017,somehow hitting the edge of a billboard,iPhone
8374,knocking a building down,iPhone
2939,hitting a wall and tumbling to my death,iPad
2021,turning into a fine mist,iPad
</pre>
<p>Now for some more fun &#8211; visualization and analysis. This is performed in R because, well, R is awesome. That, and I really need some more practice with R.</p>
<p>To date, I&#8217;ve collected just over 1200 Canabalt &#8216;events&#8217;. I will likely turn this into a web app if there&#8217;s enough interest.</p>
<p>A couple of summaries:</p>
<p>scores by device type:</p>
<pre class="brush: plain; title: ; notranslate">
      device count mean stddev median   max min range
      iPhone   735 4491   3882 3419.0 36332 102 36230
        iPad   284 4723   3884 4041.5 40630 104 40526
  iPod touch   189 3734   3644 2713.0 28024 102 27922&gt;
</pre>
<p>scores by type of death:</p>
<pre class="brush: plain; title: ; notranslate">
                                            death count mean stddev median   max  min range
          hitting a wall and tumbling to my death   684 4155   3481 3319.5 36332  102 36230
                           missing another window   243 5898   4981 4486.0 40630  409 40221
                         turning into a fine mist    86 3592   2698 2662.5 16441  614 15827
            colliding with some enormous obstacle    40 4768   4247 3256.5 16933  433 16500
                              falling to my death    37 4176   3160 3619.0 13573  567 13006
                       missing a crane completely    22 2950   1774 2923.5  7883  381  7502
                         knocking a building down    21 3399   2267 2849.0  8374  336  8038
                   not quite reaching a billboard    19 3098   1244 2980.0  5772  444  5328
              landing where a building used to be    17 4804   4970 3631.0 22685 1170 21515
          somehow hitting the edge of a billboard    14 5991   3827 5518.5 13547  566 12981
   just barely stumbling out of the first hallway    13  104      1  104.0   104  102     2
              somehow hitting the edge of a crane     7 5497   4835 4942.0 13275  510 12765
       riding a falling building all the way down     4 4278   2162 4195.5  6993 1727  5266
           completely  missing the entire hallway     1 1046     NA 1046.0  1046 1046     0
</pre>
<p>And now, in the spirit of killing the almighty ink-data ratio, here are some pictures:<br />
<img class="alignnone size-full wp-image-503" title="overall plot of scores" src="http://www.neilkodner.com/wp-content/uploads/2011/02/canabaltscores.png" alt="plot of scores" width="619" height="630" /></p>
<p><a href="http://www.neilkodner.com/wp-content/uploads/2011/02/bydeathfactedbytype11.png"><img class="alignnone size-large wp-image-505" title="by death faceted by device type" src="http://www.neilkodner.com/wp-content/uploads/2011/02/bydeathfactedbytype11-1024x779.png" alt="by death faceted by device type" width="717" height="545" /></a></p>
<p><a href="http://www.neilkodner.com/wp-content/uploads/2011/02/scores-by-device.png"><img class="alignnone size-full wp-image-507" title="scores by device" src="http://www.neilkodner.com/wp-content/uploads/2011/02/scores-by-device.png" alt="scores by device" width="534" height="539" /></a></p>
<p><a href="http://www.neilkodner.com/wp-content/uploads/2011/02/bydeathtype.png"><img class="alignnone size-large wp-image-509" title="bydeathtype" src="http://www.neilkodner.com/wp-content/uploads/2011/02/bydeathtype-1024x641.png" alt="by death type" width="614" height="385" /></a></p>
<p>What have we learned? So far, while my data set isn&#8217;t altogether that large(1200 events), we might have enough to make some basic observations and assumptions(correction please!). Going into this experiment I thought that iPad players would have generally higher scores. This is because of #1 the larger screen size and #2 players wouldn&#8217;t necessarily be playing &#8216;on-the-go&#8217; as they would be (I know I am) on an iPhone or iPod touch. The iPad has higher median and average scores than the other devices. I&#8217;d like to revisit this as I collect more data.</p>
<p>The leading cause of Canabalt death, by far, is hitting a wall and tumbling to one&#8217;s death. This surprised me as I thought it would be falling to death &#8211; that&#8217;s how my Canabalt games seem to end.</p>
<p>I&#8217;d like to hear your comments suggestions for new analysis, and most of all, your corrections.  You know who you are and this is how I learn. The data and python/R source can be found on <a href="https://github.com/neilkod/canabalt">github</a>.</p>
<p>The stack: Twitter Streaming API, EC2, MongoDB, Python, Regular Expressions, R</p>
<p>Things I learned working on this: <a href="http://had.co.nz/plyr/">plyr</a>(group-by and aggregation in R), sorting dataframes in R, couple of new <a href="http://had.co.nz/ggplot2/">ggplot2</a> tricks.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.neilkodner.com/2011/02/visualizations-of-canabalt-scores-scraped-from-twitter/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>My Twitter bots:  Tens of thousands of followers can&#8217;t be wrong</title>
		<link>http://www.neilkodner.com/2010/12/my-twitter-bots-tens-of-thousands-of-followers-cant-be-wrong/</link>
		<comments>http://www.neilkodner.com/2010/12/my-twitter-bots-tens-of-thousands-of-followers-cant-be-wrong/#comments</comments>
		<pubDate>Tue, 21 Dec 2010 12:20:24 +0000</pubDate>
		<dc:creator>admin</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[analysis]]></category>
		<category><![CDATA[data]]></category>
		<category><![CDATA[programming]]></category>
		<category><![CDATA[seinfeld]]></category>
		<category><![CDATA[twitter]]></category>

		<guid isPermaLink="false">http://www.neilkodner.com/?p=412</guid>
		<description><![CDATA[edit: March 17, 2011 I need your help! If you have additional Seinfeld quotes to contribute, or for a list of all of the current Seinfeld quotes, please visit this post. My current army of twitter bots and the keyword that each one responds to: @HelloooooNewman (seinfeld) Klout score 74 @TheBotLebowski (lebowski) Klout score 70 [...]]]></description>
			<content:encoded><![CDATA[<p><strong>edit: March 17, 2011 I need your help! If you have additional Seinfeld quotes to contribute, or for a list of all of the current Seinfeld quotes, <a href="http://www.neilkodner.com/2011/03/looking-for-some-new-quotes-for-hellooooonewman/">please visit this post.</a></strong></p>
<p><a href="http://www.neilkodner.com/wp-content/uploads/2010/12/bot-followers.png"><img class="alignnone size-full wp-image-414" title="bot followers" src="http://www.neilkodner.com/wp-content/uploads/2010/12/bot-followers.png" alt="" width="839" height="182" /></a></p>
<p>My current army of twitter bots and the keyword that each one responds to:</p>
<ul>
<li><a href="http://www.twitter.com/#!/hellooooonewman">@HelloooooNewman</a> (seinfeld) <a href="http://klout.com/hellooooonewman">Klout score 74</a></li>
<li><a href="http://www.twitter.com/#!/thebotlebowski">@TheBotLebowski</a> (lebowski) <a href="http://klout.com/thebotlebowski">Klout score 70</a></li>
<li><a href="http://www.twitter.com/#!/acenterforants">@ACenterForAnts</a> (zoolander) <a href="http://klout.com/acenterforants">Klout score 70</a></li>
<li><a href="http://www.twitter.com/#!/iamjacksbot">@IAmJacksBot</a> (fight club) <a href="http://klout.com/iamjacksbot">Klout Score 74</a></li>
<li><a href="http://www.twitter.com/#!/amaninamask">@AManInAMask</a> (V for Vendetta)</li>
<li><a href="http://www.twitter.com/#!/worldofshit">@WorldOfShit</a> (full metal jacket) <a href="http://klout.com/worldofshit">Klout score 65</a></li>
<li><a href="http://www.twitter.com/#!/somegrenades">@SomeGrenades</a> (serenity + firefly) <a href="http://klout.com/somegrenades">Klout score 56</a></li>
<li><a href="http://www.twitter.com/#!/gunshowtickets">@GunShowTickets</a> (Ron Burgundy) No Klout score yet</li>
<li><a href="http://www.twitter.com/#!/pleasebe18">@PleaseBe18</a> (Ricky Bobby)</li>
<li><a href="http://www.twitter.com/#!/abakingpowder">@ABakingPowder</a> (schwing) <a href="http://klout.com/abakingpowder">Klout score 54</a></li>
<li><a href="http://www.twitter.com/#!/which_is_nice">@Which_is_nice</a> (caddyshack) <a href="http://klout.com/which_is_nice">Klout score 56</a></li>
<li><a href="http://www.twitter.com/#!/mitchhedbot">@MitchHedbot</a> (mitch hedberg) No Klout Score yet</li>
<li><a href="http://www.twitter.com/#!/dubbbya">@dubbbya</a> (gwb) &#8212; banned from twitter</li>
<li><a href="http://www.twitter.com/#!/dreidly">@dreidly</a> (dreidel) &#8212; retired</li>
</ul>
<p>I&#8217;ve also built a few programs that scrape air quality data from the State of Utah and tweet the results.</p>
<ul>
<li><a href="http://www.twitter.com/#!/utahairquality">@UtahAirQuality</a> serving Salt Lake and Davis Counties</li>
<li><a href="http://www.twitter.com/#!/webercountyair">@WeberCountyAir</a></li>
<li><a href="http://www.twitter.com/#!/cachecountyair">@CacheCountyAir</a></li>
<li><a href="http://www.twitter.com/#!/utahcountyair">@UtahCountyAir</a></li>
</ul>
<p><a href="http://twitter.com/#!/usadebtlevel">@usadebtlevel</a> which tweets the US National Debt and each US Citizen&#8217;s share.</p>
<p>And here&#8217;s a sneak preview: @SarahEffinPalin was conceived after a friend, Willie Morris (<a href="http://www.twitter.com/#!/morewillie">@morewillie</a>) suggested a bot that, lets say, repurposes <a href="http://www.twitter.com/#!/sarahpalinusa">Sarah Palin&#8217;s</a> tweets.  I think <a href="http://www.twitter.com/#!/saraheffinpalin">@SarahEffinPalin</a> is going to be a hit.</p>
<p>We all know The Big Lebowski is a cult classic and one of the most quoteable movies of all time.  I don&#8217;t exactly remember how this started but a long time ago, I thought people who mentioned &#8220;Lebowski&#8221; in a tweet would appreciate receiving a quote from the movie.  So with nothing but the Twitter API docs and a little bit of python, I built <a href="http://www.twitter.com/#!/thebotlebowski">@thebotlebowski</a>, my first auto-responder.  The idea was simple &#8211; using urllib2, perform a search for &#8220;lebowski&#8221;, and iterate through the results.  For each result, retrieve a random entry out of a quotes database and tweet it as a reply to the original tweet.</p>
<p><span style="font-size: 11.6667px;">After a ton of retweets, #ff mentions, replies, and followers, it became pretty obvious that people liked it.  I needed a followup &#8211; another infinitely quoteable movie.  Zoolander!  Thus, <a href="http://www.twitter.com/#!/">@ACenterForAnts</a> was born.  Again, my research showed that all mentions of Zoolander on twitter were either references to the movie or Derek Zoolander himself.</span></p>
<p>Another follow-up was in order.  A <a href="http://www.twitter.com/#!/abstractdata">friend</a> suggested a Seinfeld one.  Done. Welcome <a href="http://www.twitter.com/#!/hellooooonewman">@HelloooooNewman</a>. And then <a href="http://www.twitter.com/#!/iamjacksbot">@IAmJacksBot</a> and then the others.  The key was to create bots that use search terms that are not vague &#8211; If someone tweets &#8220;Full Metal Jacket&#8221;, then they&#8217;re obviously talking about the movie.  Same with &#8220;Fight Club.&#8221;</p>
<p>One of the lessons learned was that not everyone who tweets about &#8220;GWB&#8221; was necessarily referencing the president.  A <a href="http://www.twitter.com/#!/dtseiler">friend</a> suggested a bot that replies to mentions of GWB with one of George Bush&#8217;s self-butchered quotes.  People loved it except for people in New York &#8211; hey, I didn&#8217;t realize that so many people tweeted about the George Washington Bridge in abbreviated format!  This includes several NYC twitter accounts that automatically post traffic conditions.  The complaints came in quicker than I could add people to the ignore list.  Eventually, the well-loved but polarizing <a href="http://www.twitter.com/#!/dubbbya">@dubbbya</a> was banned from twitter.  May his <a href="http://www.neilkodner.com/georgewbushquotes.txt">quotes</a> live on in infamy.</p>
<p>While the bots have been very well-received, not everyone likes them.  When there were only a handful of bots, I used to monitor their responses.  Not that there are so many, I&#8217;ve added an ignore list for just this reason.  To add yourself to the ignore list, either contact me, <a href="http://www.twitter.com/#!/@neilkod">tweet me</a>, or visit <a href="http://neilkodsbots.appspot.com">http://neilkodsbots.appspot.com</a>.</p>
<p><strong>Frequently Asked Questions:</strong></p>
<p><strong>Have any celebrities found your bots?</strong></p>
<p>The bots tweet out to celebrities all of the time.  @ACenterForAnts has tweeted <a href="http://www.twitter.com/#!/redhourben">Ben Stiller</a> many, many times but Ben has never replied.  Sometimes, the celebrities tweet back.  I don&#8217;t actively monitor the mentions and replies to the bots &#8211; there are just too many.  My favorite anecdote, so far, is when <a href="http://www.twitter.com/#!/adamsbaldwin">Adam Baldwin</a> discovered <a href="http://www.twitter.com/#!/worldofshit">@worldofshit</a>, my Full Metal Jacket bot and immediately triggered it over and over to receive new quotes.  He then started tweeting about the bot to his followers and it quickly picked up steam.  I was thrilled to see that one of the stars of Full Metal Jacket was tweeting so favorably about a program that I wrote that I created <a href="http://www.twitter.com/#!/somegrenades">@somegrenades</a> in his honor.</p>
<p>If you notice a celebrity or otherwise notable person referencing one of my bots, please let me know.  <a href="http://www.delicious.com/neilkod/celebrity">The mentions that I know about</a> include Q-Tip, Taleb Kweli, and Fred Durst.</p>
<p><strong>But you&#8217;re a data geek, not a twitter programmer!  Are you doing anything cool with the data?</strong></p>
<p>Yes!  Every tweet that I find, I log.  For example, since <a href="http://www.twitter.com/#!/">@HelloooooNewman</a> has already sent out over 170,000 replies, I have at least that many incoming tweets mentioning Seinfeld in my logs.  I am able to tell who&#8217;s tweeting about Seinfeld, when people are talking about Seinfeld, what they&#8217;re saying, and so on and so forth.  I can even tell if certain events, such as the release of a box set or a new event have resulted in an increase of Seinfeld tweets.  For examples of some of the things I&#8217;ve done with the twitter data, check out this <a href="http://www.neilkodner.com/2010/04/hacking-seinfeld-tweets-with-apache-pig-a-work-in-progress/">analysis of Seinfeld Tweets</a> or this <a href="http://www.neilkodner.com/2010/11/what-do-23000-charlie-sheen-tweets-look-like/">word cloud generated from 23,000 tweets about Charlie Sheen</a>.  Please contact me if you&#8217;d like to hear more.</p>
<p><span style="font-size: 11.6667px;"><strong>Would you create a bot for me/my company/my promotion?</strong></span></p>
<p>I get asked this all of the time.  The answer is:  It depends.  Lets talk.  Before we go about doing this, we&#8217;d need to establish a few ground rules.  I&#8217;ve worked very hard to keep the bots entertaining and not spammy.</p>
<p><strong>The tweets don&#8217;t include urls or advertisements &#8211; Are you making any money off of them?</strong></p>
<p>While the bots don&#8217;t generate income directly, they have led to other opportunities and benefits.  For starters, I&#8217;ve picked up a ton of quality followers and contacts that I would have never met.  Additionally, through this experience, I&#8217;ve learned a great deal about twitter the twitter API, and numerous features of Python that I wouldn&#8217;t have normally dived into.  To answer the question, A few companies and web sites have licensed the technology and I&#8217;ve created custom bots and twitter searches for them.  I&#8217;ve elected to not mention them directly in this post.</p>
<p><strong>I don&#8217;t want the bots to reply to my tweets.  Can they ignore me?</strong></p>
<p>Sure, the easiest way to be ignored is to visit <a href="http://neilkodsbots.appspot.com">http://neilkodsbots.appspot.com</a> and add yourself to the ignore list.  Honor system please!  I didn&#8217;t feel it was necessary to ask users to authenticate via twitter just so my application could ignore them.</p>
<p><strong>I&#8217;m selling Seinfeld/Zoolander/Lebowski products &#8211; will you tweet this link to all of your bots followers?</strong></p>
<p>I also get asked this all of the time.  <a href="http://www.twitter.com/#!/hellooooonewman">@HelloooooNewman</a> has over ten thousand followers.  <a href="http://www.twitter.com/#!/ACenterForAnts">@ACenterForAnts</a> and <a href="http://www.twitter.com/#!/TheBotLebowski">@TheBotLebowski</a> also combine for another ten thousand followers.  While I won&#8217;t send a mention of your product/URL/promotion to their followers, I do have other methods of driving traffic and building awareness to a targeted group of followers.  Lets talk.</p>
<p><strong>May I have the source code?</strong></p>
<p>Since I&#8217;ve been using this program for a few not-mentioned commercial purposes, I&#8217;m not interested in sharing the secret sauce.  I will, however, let you know it was pretty straightforward to do.  Anyone with a minimum of programming skill should be able to do this.</p>
<p><strong>Will there be more bots?</strong></p>
<p>Always.  I&#8217;m always on the lookout for new ideas.  Let me know if you have any.  The next one on my plate will be one for Eastbound and Down.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.neilkodner.com/2010/12/my-twitter-bots-tens-of-thousands-of-followers-cant-be-wrong/feed/</wfw:commentRss>
		<slash:comments>36</slash:comments>
		</item>
		<item>
		<title>Word Cloud from 6,500 tweets mentioning Kayne West.  From this morning</title>
		<link>http://www.neilkodner.com/2010/12/word-cloud-from-6500-tweets-mentioning-kayne-west-from-this-morning/</link>
		<comments>http://www.neilkodner.com/2010/12/word-cloud-from-6500-tweets-mentioning-kayne-west-from-this-morning/#comments</comments>
		<pubDate>Tue, 14 Dec 2010 22:25:32 +0000</pubDate>
		<dc:creator>admin</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[analysis]]></category>
		<category><![CDATA[data]]></category>
		<category><![CDATA[kanye]]></category>
		<category><![CDATA[kanyewest]]></category>
		<category><![CDATA[programming]]></category>
		<category><![CDATA[python]]></category>
		<category><![CDATA[twitter]]></category>

		<guid isPermaLink="false">http://www.neilkodner.com/?p=403</guid>
		<description><![CDATA[After removing a few stopwords and then clearing out a few other words(nowplaying, lastfm, and the like), here&#8217;s what&#8217;s left.  The data represents a half-day&#8217;s worth of tweets.   I&#8217;m sitting on about 90,000 tweets about Kanye and am looking forward to taking the time for some more in-depth analysis.  Huge thanks to @jrlevine and [...]]]></description>
			<content:encoded><![CDATA[<p>After removing a few <a href="http://www.neilkodner.com/stopwords.txt">stopwords</a> and then clearing out a few other words(nowplaying, lastfm, and the like), here&#8217;s what&#8217;s left.  The <a href="http://www.neilkodner.com/kanyetoday.txt">data</a> represents a half-day&#8217;s worth of tweets.   I&#8217;m sitting on about 90,000 tweets about Kanye and am looking forward to taking the time for some more in-depth analysis.  Huge thanks to <a href="http://www.twitter.com/#!/jrlevine">@jrlevine</a> and <a href="http://www.twitter.com/#!/alexmr">@alexmr</a> from <a href="http://www.twordsie.com">twordsie.com</a> for curating the awesome stopwords list, which I found in their <a href="https://github.com/jakelevine/twordsie">github project</a>.</p>
<p><a href="http://www.neilkodner.com/wp-content/uploads/2010/12/kanye-word-cloud.png"><img class="alignnone size-large wp-image-404" title="kanye word cloud" src="http://www.neilkodner.com/wp-content/uploads/2010/12/kanye-word-cloud-1024x447.png" alt="kayne west tweets word cloud" width="1024" height="447" /></a></p>
]]></content:encoded>
			<wfw:commentRss>http://www.neilkodner.com/2010/12/word-cloud-from-6500-tweets-mentioning-kayne-west-from-this-morning/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>And you thought you were the first to use #DONTFUCKWITHJUSTINBIEBER</title>
		<link>http://www.neilkodner.com/2010/08/and-you-thought-you-were-the-first-to-use-dontfuckwithjustinbieber/</link>
		<comments>http://www.neilkodner.com/2010/08/and-you-thought-you-were-the-first-to-use-dontfuckwithjustinbieber/#comments</comments>
		<pubDate>Mon, 09 Aug 2010 16:58:55 +0000</pubDate>
		<dc:creator>admin</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[analysis]]></category>
		<category><![CDATA[data]]></category>
		<category><![CDATA[funny]]></category>
		<category><![CDATA[hadoop]]></category>
		<category><![CDATA[hashtag]]></category>
		<category><![CDATA[pig]]></category>
		<category><![CDATA[twitter]]></category>

		<guid isPermaLink="false">http://www.neilkodner.com/?p=287</guid>
		<description><![CDATA[Through the magic of hadoop, pig, over 300 million(and counting) tweets, and the never-ending creativity of my fellow twitter users, I thought I&#8217;d take a look at all of the hashtags containing the beloved f-word. Lets get the technical details out of the way.  Since the middle of June, I&#8217;ve been saving as many tweets [...]]]></description>
			<content:encoded><![CDATA[<p>Through the magic of <a href="http://hadoop.apache.org/">hadoop</a>, <a href="http://hadoop.apache.org/pig">pig</a>, over 300 million(and counting) tweets, and the never-ending creativity of my fellow twitter users, I thought I&#8217;d take a look at all of the hashtags containing the beloved f-word.</p>
<p>Lets get the technical details out of the way.  Since the middle of June, I&#8217;ve been saving as many tweets as I can to local storage, using Twitter&#8217;s streaming API and my gardenhose access.  Sorry, <a href="http://www.cloudera.com">Cloudera</a> guys, I&#8217;m not yet using <a href="http://github.com/cloudera/flume">flume</a>, but it&#8217;s high on the to-do list.  Using a 3-node cluster, I&#8217;m able to search through these tweets and extract valuable(?) data in a matter of minutes.</p>
<p>The pig script(Sorry, looks like gist.github.com doesn&#8217;t auto-format pig):<br />
<script src="http://gist.github.com/515641.js"></script></p>
<p>And now the fun stuff.  I found over 31,000 different hashtags containing the f-word.  Bonus to the first person who can tell me what GFW is.<br />
The top-ten results and the frequency of their mentions are:</p>
<pre class="brush: plain; title: ; notranslate">
#fuck	10406
#fuckouttahere	3172
#fuckinfollow	3062
#fuckit	2970
#fuckyou	2303
#fuckgfw	1573
#fuckyeah	1551
#fuckery	1436
#fucking	1273
#fuckoff	988
</pre>
<p>Lets move on to what&#8217;s really important-celebrities, sports figures, and other important American topics:</p>
<p>Lady Gaga</p>
<pre class="brush: plain; title: ; notranslate">
#dontfuckwiththegaga	11
#fuckthegagahaters	4
#fuckgaga	3
#fuckladygaga	3
#dontfuckwithgaga	2
#fuckmegaga	2
#fucktrannygaga	1
</pre>
<p><span id="more-287"></span></p>
<p>Obama</p>
<pre class="brush: plain; title: ; notranslate">
#fuckobama	7
#dontfuckwithobama	2
#fuckyouobama	1
#fuckbarackobama	1
#fucking_twat_obama	1
#obamabrieflymadefuckedmeover	1
</pre>
<p>taxes</p>
<pre class="brush: plain; title: ; notranslate">
#fucktaxes	5
#blackpeopleneverpaybillsfuckinuptaxescredit	1
#fuckingtaxes	1
</pre>
<p>The NY Yankees</p>
<pre class="brush: plain; title: ; notranslate">
#fucktheyankees	31
#fuckyankees	3
#fuckdayankees	1
#fuckyouyankees	1
</pre>
<p>The Red Sox</p>
<pre class="brush: plain; title: ; notranslate">
#fucktheredsox	2
#fuckredsox	1
#fuckredsoxfans	1
#fucktheredsoxs	1
#ifuckinghaaaateredsoxanddavidortizandhisslowfatassshouldveputarodinmaybewouldvewon	1
#fucktheredsoxandanybodywhoplaysforthem	1
</pre>
<p>Lakers</p>
<pre class="brush: plain; title: ; notranslate">
#fuckthelakers	279
#fucklakers	65
#teamfuckthelakers	48
#fuckdalakers	39
#fuckteamlakers	14
#teamfuckdalakers	9
#teamfuckinglakers	8
#teamfucklakers	7
#fuckyoulakers	6
...
#itsstillfucklakersalldayeverydaytillkoberetires	1
#teamidontgiveafuckaboutlakersorcelticskickrockd	1
#fuckdaflakers	1
#fuckyealakers	1
#fuckalakersfan	1
#fucklakersssss	1
#fucktheflakers	1
#fuckyourlakers	1
#fuckeverylakersfanonthegotdamnplanetcuztheyaintshitforeal	1
</pre>
<p>Celtics</p>
<pre class="brush: plain; title: ; notranslate">
#fucktheceltics	43
#fuckceltics	39
#teamfuckceltics	16
#fuckteamceltics	8
#teamfucktheceltics	5
#teammmmfuckingceltics	4
#fuckdaceltics	3
#teamfuckthecelticsandlakers	3
#fuckthemceltics	3
#fuckyouceltics	2
...
#fuckthelakersandceltics	1
#fuckyourfeelingsceltics	1
#teamcelticsallfuckingday	1
#fuckthecelticsandthehaters	1
#teamifuckinghatetheceltics	1
#fuckthelakersfucktheceltics	1
#teamfuckthecelticswith10dicks	1
</pre>
<p>Lebron</p>
<pre class="brush: plain; title: ; notranslate">
#fucklebron	366
#teamfucklebron	45
#fucklebronjames	32
#fuckyoulebron	18
#fucklebronandhisdecision	16
#fuckalebron	4
#teamfucklebronjames	4
#newyorksaysfucklebron	4
#fuckyoutolebron	3
#fucklebronforlife	2
#fucklebronbitchass	2

(many more)
...
#teamgetthefuckofflebrondickhalfofyalgotsummerschooldoyourhomeworkyoudickriders	1
#fuckouttaherelebrons	1
#fucklebronheabitchassniggaheisnotarealmanfaggotassbitchassdickridinassthatswhyhesecondtodwade	1
</pre>
<p>Haters</p>
<pre class="brush: plain; title: ; notranslate">
#fuckthehaters	67
#fuckhaters	25
#fuckyouhaters	15
#fuckinhaters	7
#fuckjlshaters	6
#fuckdahaters	6
#demihatersfuck	4
#fuckdemihaters	4
</pre>
<p>Snitches</p>
<pre class="brush: plain; title: ; notranslate">
#fucksnitches	3
#fuckinsnitches	1
</pre>
<p>and finally, who could forget J-Bieb</p>
<pre class="brush: plain; title: ; notranslate">
#fuckyoubieberisafag	126
#dontfuckwithjustinbieber	105
#fuckjustinbieber	10
#fuckbieber	10
#bieberisafagshouldshutthefuckup	6
#fuckyoubieber	6
#teamfuckbieber	5
#fuckoffbieberarmy	5
#dontfuckwithbieber	5
#biebersnewhaircutisfuckinsexysostfuitsjusthairitwillgrowbackgetafuckinlifebitches	4
#whothefuckisjustinbieber	3
#fuckingunfollowbieberarmy	3
#dontfuckwithjustinbieberslegalbeliebers	3
#ifuckbieber	2
(many many more....)
#fuckyeahjustinbiebermix	1
#fuckinunfollowbieberarmy	1
#ohmyfuckinjustinbiebergasm	1
#fuckthattinylittlebieberfag	1
#fuckyoubiebertyzasranyklamco	1
#justinbieberisafuckingpussyshit	1
#biebersafagshouldgetafuckinglife	1
#fuckyouallthehatersofjustinbieber	1
#ilovejustindrewbieberfuckwatucare	1
#fuckjustinbieberbringfalloutboyback	1
#whothefuckisstilltrendingjustinbieber	1
#youstupidbieberhatersneedafuckinglife	1
#fuckjustinbieberinhisstupidlookingbangs	1
</pre>
<p>And how about those who can&#8217;t spell Bieber?</p>
<pre class="brush: plain; title: ; notranslate">
#fuckbeiber	1
#fuckyoubeiber	1
#teamfuckbeiber	1
#fuckjustinbeiber	1
#dontfuckwithjustinbeiber	1
#fuckinwiththatjustbeiber	1
#justinbeibershouldgofuckhimself	1
</pre>
<p>We&#8217;ve got geographic locations covered as well<br />
Jersey/New York/Philly</p>
<pre class="brush: plain; title: ; notranslate">
#fuckjerseyshore	22
#fuckphilly	7
#fucknewjersey	4
#newyorksaysfucklebron	4
#fuckthegirlswhomadejustincriedandrolledoffbackstageinnewjersey	3
#teamfuckjerseyshore	3
#jerseyfuckingshore	2
#ifuckinglovenewyork	2
#jerseyshoreisfulloffucks	2
#fuckmarryorkill	2
#fuckjersey	1
#fatitalianmufuckawitthebeardfromsouthphilly	1
#fucknewyork	1
#fuckyouphilly	1
#fuckmaryorkill	1
#fuckyounewjersey	1
#fuckjerseytransit	1
#fuckthejerseyshore	1
#ifuckinglovejersey	1
#jerseyshorefuckery	1
#fuckyounewyorkstate	1
#fuckyouphillypeople	1
</pre>
<p>France</p>
<pre class="brush: plain; title: ; notranslate">
#fuckfrance	13
#fuckyoufrance	1
</pre>
<p>Spain</p>
<pre class="brush: plain; title: ; notranslate">
#fuckspain	31
#fuckyouspain	4
#gofuckyourselfspain	3
#doublefuckspain	3
#teamfuckcaresboutspain	2
#teamfuckinspain	2
#fuckuspain	1
#fuckingspain	1
</pre>
<p>Work</p>
<pre class="brush: plain; title: ; notranslate">
#fuckwork	73
#whenthefuckamiworkingnextcauseireallyneedsomemoneyasap	6
#fuckyouwork	6
#teamfuckwork	5
#fuckfireworks	4
#fuckcoworkers	3
#fuckworking	3
#fuckbuyinfireworksbuybullets	3
#fuckingwork	3
#fuckworkofart	3
#fuckworktomorrow	3
#fuckyocoworker	2
#putthemfuckinheelsonandworkitgirl	2
#fuckhomework	2
#teamfucksleepgotoworktiredandstillgetthemoneyswag	2
#teamboredasfucktonightcauseireallydontcareaboutfireworks	2
#fuckgoingtowork	2
#fuckinwork	1
#fuckanetwork	1
#fuckfirworks	1
</pre>
<p>School</p>
<pre class="brush: plain; title: ; notranslate">
#fuckschool	80
#fucksummerschool	11
#teamfucksummerschool	10
#teamfuckschool	5
#schoolisfuckery	4
#fuckhighschool	2
#schoolfucksmylife	2
#fuckhighschoolconfessions	2
#fuckyouschool	2
#fuckkkschool	1
</pre>
<p>A great one suggested by my friend Ken</p>
<p>Police</p>
<pre class="brush: plain; title: ; notranslate">
#fuckthepolice	388
#fuckdapolice	102
#fuckthapolice	26
#teamfuckthepolice	8
#fuckpolice	4
#teamfuckdapolice	3
#fuckdapolicetweet	2
#fuckthepolice2010	2
#policesayfuckofftomedia	2
#fuck_the_police	1
#fuckthepolicex3	1
#fuckgrammarpolice	1
#fuckthapoliceyeah	1
</pre>
<p>Other observations:<br />
Many more mentions of math than science OR homework<br />
A few mentions of Lance but none of Contador</p>
<p>The full dataset can be downloaded <a href="http://www.neilkodner.com/fwordhashtags.txt">here</a>.  The top ten thousand most frequently occurring hashtags can be found <a href="http://www.neilkodner.com/toptenkfwordhashtags.txt">here</a>.</p>
<p>To-do: modify the pig script for variances of spelling the f-word, multiple u&#8217;s, etc.  Maybe even a visualization.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.neilkodner.com/2010/08/and-you-thought-you-were-the-first-to-use-dontfuckwithjustinbieber/feed/</wfw:commentRss>
		<slash:comments>5</slash:comments>
		</item>
		<item>
		<title>A few quick observations on StackOverflow questions tagged R</title>
		<link>http://www.neilkodner.com/2009/11/a-few-quick-observations-on-stackoverflow-questions-tagged-r/</link>
		<comments>http://www.neilkodner.com/2009/11/a-few-quick-observations-on-stackoverflow-questions-tagged-r/#comments</comments>
		<pubDate>Thu, 12 Nov 2009 15:10:42 +0000</pubDate>
		<dc:creator>admin</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[analysis]]></category>
		<category><![CDATA[data]]></category>
		<category><![CDATA[r]]></category>
		<category><![CDATA[rstats]]></category>
		<category><![CDATA[stackoverflow]]></category>

		<guid isPermaLink="false">http://www.neilkodner.com/?p=59</guid>
		<description><![CDATA[While browsing through Pete Skomoroch&#8217;s delicious bookmarks(which is a full-time job in and of itself), I learned that StackOverflow.com makes their underlying q&#38;a data available. Just for fun, I wrote a few quick queries against this dataset, centered around the R tag. Here are a handful of findings &#8211; data is through 31-Oct-2009. Some of [...]]]></description>
			<content:encoded><![CDATA[<p>While browsing through <a href="http://www.datawrangling.com">Pete Skomoroch&#8217;s </a> <a href="http://delicious.com/pskomoroch/">delicious bookmarks</a>(which is a full-time job in and of itself), I learned that <a href="http://www.stackoverflow.com">StackOverflow.com</a> makes their underlying q&amp;a data <a href="http://blog.stackoverflow.com/2009/11/creative-commons-data-dump-nov-09/">available</a>.</p>
<p>Just for fun, I wrote a few quick queries against this dataset, centered around the R tag.  Here are a handful of findings &#8211; data is through 31-Oct-2009.  Some of this data is already presented in the StackOverflow site but bear with me here.</p>
<p>The most common tags associated with R are:<br />
statistics &#8211; 46<br />
ggplot2 &#8211; 20<br />
plot &#8211; 13<br />
graphics &#8211; 10<br />
vector &#8211; 9<br />
emacs &#8211; 8<br />
matrix &#8211; 8</p>
<p>We all know that Dirk, Shane, and Hadley lead the way in terms of questions answered, but who knew that chris_dubois leads the pack when it comes to answering their own question with 10?  </p>
<p>And finally, out of 20 posts totaling 32 answers <a href="http://stackoverflow.com/questions/tagged/ggplot2">tagged with ggplot2</a>(at the time),<a href="http://www.had.co.nz/">Hadley Wickham</a>, the <a href="http://www.had.co.nz/ggplot2/">package&#8217;s</a> author has only contributed three answers.  The fact that the rest of the questions were answered by users speaks <strong>volumes</strong> of the community behind ggplot2. Excellent Work, Hadley!</p>
<p>Here is my version of the leaderboard as of the end of October, 2009.</p>
<div id="attachment_63" class="wp-caption alignnone" style="width: 674px"><img src="http://www.neilkodner.com/wp-content/uploads/2009/11/r_october_leaderboard.jpg" alt="r stackoverflow leaderboard" title="r_october_leaderboard" width="664" height="1094" class="size-full wp-image-63" /><p class="wp-caption-text">r stackoverflow leaderboard</p></div>
]]></content:encoded>
			<wfw:commentRss>http://www.neilkodner.com/2009/11/a-few-quick-observations-on-stackoverflow-questions-tagged-r/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
	</channel>
</rss>

