<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>neilkodner.com</title>
	<atom:link href="http://www.neilkodner.com/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.neilkodner.com</link>
	<description>Oracle, Python, R, Data, Cycling/Multisport, you name it.</description>
	<lastBuildDate>Mon, 23 Aug 2010 18:38:22 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.0.1</generator>
		<item>
		<title>performing a grouped top-n query in pig</title>
		<link>http://www.neilkodner.com/2010/08/performing-a-grouped-top-n-query-in-pig/</link>
		<comments>http://www.neilkodner.com/2010/08/performing-a-grouped-top-n-query-in-pig/#comments</comments>
		<pubDate>Mon, 23 Aug 2010 18:38:22 +0000</pubDate>
		<dc:creator>admin</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://www.neilkodner.com/?p=338</guid>
		<description><![CDATA[Over the course of generating a large item-item similarity matrix, I need to reduce the amount of data I&#8217;m returning to the calling program.  In short, i&#8217;m computing the similarity between over 20,000 different &#8216;items&#8217; and that results in a gigantic dataset, to the tune of about 3-4 million elements.  I now need to reduce [...]]]></description>
			<content:encoded><![CDATA[<p>Over the course of generating a large item-item similarity matrix, I need to reduce the amount of data I&#8217;m returning to the calling program.  In short, i&#8217;m computing the similarity between over 20,000 different &#8216;items&#8217; and that results in a gigantic dataset, to the tune of about 3-4 million elements.  I now need to reduce my dataset down to the nearest neighbors for each item and prune irrelevant data.</p>
<p>The real problem is:</p>
<p>Given a list of 20,000 items, each item has a corresponding &#8216;other&#8217; item and the Jaccard/Tanimoto similarity between the two items, show me the k-closest items for each item in my list.</p>
<pre class="brush: plain;">

1   3   .00321
1   4   .00256
1   5   .01019
1   6   .00732
2   1   .02136
....
</pre>
<p>I thought of doing this in pig but wasn&#8217;t really sure how to limit and sort grouped data.  I submitted a question to <a href="http://hadoop.apache.org/pig/mailing_lists.html#Users">pig-user</a> and the members were helpful.  Since I learned a new trick, I thought I&#8217;d document it here in case anyone else is looking to do the same.</p>
<p>Rather than bore everyone with item ids and similarity scores(and violating an NDA and losing lots of friends), I&#8217;ll use example data from one of Oracle&#8217;s demo tables, emp:</p>
<pre class="brush: plain;">

SQL&gt; select empno,ename,job,sal,deptno from emp order by deptno,sal desc;

EMPNO ENAME      JOB              SAL     DEPTNO
---------- ---------- --------- ---------- ----------
7839 KING       PRESIDENT       5000         10
7782 CLARK      MANAGER         2450         10
7934 MILLER     CLERK           1300         10
7788 SCOTT      ANALYST         3000         20
7902 FORD       ANALYST         3000         20
7566 JONES      MANAGER         2975         20
7876 ADAMS      CLERK           1100         20
7369 SMITH      CLERK            800         20
7698 BLAKE      MANAGER         2850         30
7499 ALLEN      SALESMAN        1600         30
7844 TURNER     SALESMAN        1500         30
7654 MARTIN     SALESMAN        1250         30
7521 WARD       SALESMAN        1250         30
7900 JAMES      CLERK            950         30
</pre>
<p>From here, I want to limit my data set to the employees with the top 3-highest salaries for each department.  The part that was foreign to me was running multiple statements on my data through each iteration of a FOREACH command.  Through my brief career using hadoop and pig, I&#8217;ve never paid much attention to grouping commands together.  After reading pig-user and also looking at some of <a href="http://www.twitter.com/thedatachef">@TheDataChef</a>&#8216;s examples in <a href="http://github.com/Ganglion/sounder">sounder</a>, I now recognize the value in doing so.</p>
<p>The code to generate the top-n (in this case 3) top salaries:<br />
<script src="http://gist.github.com/546013.js?file=top3salariesbydepartment.pig"></script></p>
<p>The output produced :</p>
<pre class="brush: plain;">
(5000,KING,7839,10)
(2450,CLARK,7782,10)
(1300,MILLER,7934,10)
(3000,SCOTT,7788,20)
(3000,FORD,7902,20)
(2975,JONES,7566,20)
(2850,BLAKE,7698,30)
(1600,ALLEN,7499,30)
(1500,TURNER,7844,30)
</pre>
<p>This example is easily extended to solve my original problem-reducing the number of similar items found for each item.  All I&#8217;d need to do is group by the first itemid, sort the items in descending order by their Jaccard/Tanimoto score, and then limit to the top-n similarly-scored items for each original itemid.</p>
<p>Now that my data will be sufficiently limited after all of the similarity scores have been calculated, I can let this process run, generate lots of scores, and not worry about polluting my database with extraneous data.  Now time to focus on generating item-item similarity without generating so much useless data in the first place!  Would love to hear your suggestions.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.neilkodner.com/2010/08/performing-a-grouped-top-n-query-in-pig/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Twifficiency scores, analyzed and visualized</title>
		<link>http://www.neilkodner.com/2010/08/twifficiency-scores-analyzed-and-visualized/</link>
		<comments>http://www.neilkodner.com/2010/08/twifficiency-scores-analyzed-and-visualized/#comments</comments>
		<pubDate>Wed, 18 Aug 2010 11:40:52 +0000</pubDate>
		<dc:creator>admin</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://www.neilkodner.com/?p=311</guid>
		<description><![CDATA[While I&#8217;ve had some success with getting a few celebrities to respond or show off @TheBotLebowski to others(fred durst, taleb kweli), Yesterday, Twifficiency one-upped me and took twitter and then the national media by storm.  Fortunately for you, @jamescun, Not too many people I know read your little Time Magazine. (I really hope you&#8217;re old [...]]]></description>
			<content:encoded><![CDATA[<p>While I&#8217;ve had some success with getting a few celebrities to respond or show off <a href="http://www.twitter.com/thebotlebowski">@TheBotLebowski</a> to others(<a href="http://twitter.com/freddurst/statuses/11238246156">fred durst</a>, <a href="http://twitter.com/RealTalibKweli/status/12212218052">taleb kweli</a>), Yesterday, <a href="http://www.twifficiency.com">Twifficiency</a> one-upped me and took twitter and then the national media by storm.  Fortunately for you, <a href="http://www.twitter.com/jamescun">@jamescun</a>, Not too many people I know read your little <a href="http://newsfeed.time.com/2010/08/17/twifficiency-by-james-cunningham-better-than-a-college-diploma/">Time Magazine</a>. (I really hope you&#8217;re old enough to get that!)</p>
<p>As some of you may or may not know, I aggregate Twitter data and then use tools such as python, hadoop, pig, and R to play with the results.  Today&#8217;s task was easy &#8211; Look through yesterday&#8217;s tweets, grab the Twifficiency auto-tweets(eeew), extract the scores, and then see if there are any interesting results.</p>
<p>After all of yesterday&#8217;s tweets ran through my <a href="http://github.com/neilkod/tweetParser">parser</a>, I then filtered the input data to tweets that looked like they were Twifficiency scores(where 20100817.txt is the file of yesterday&#8217;s parsed tweets that gets loaded into HDFS)</p>
<pre class="brush: plain;">grep &quot;My Twifficiency score is [0-9]*%. Whats yours? http://twifficiency.com/$&quot; 20100817.txt &gt;twif.out</pre>
<p>I loaded twif.out through some horrible-before-the-coffee-is-even-made python code to produce a few summary statistics and then a file containing just the raw twifficiency scores:</p>
<p><span id="more-311"></span></p>
<pre class="brush: python;">
#!/usr/bin/python
import sys,re,numpy

scores={}
scorelist=[]
# populate a dict from values 0-100 to match the existing twifficiency scores
for i in range(100):
  scores[i]=0

# load an extracted sample of my twitter data
#open a file
f = open('scores.txt','w')
for line in file('twif.out'):

  (id,ts,user,tweet)=line.strip().split('\t')

  #extract the numeric score from the tweet
  thescore=re.search('[0-9][0-9]?',tweet)
  scoreval=int(thescore.group(0))
  scorelist.append(scoreval)

  #  eventually do something with the timestamp, etc.
  #  print &quot;%s\t%s\t%s&quot; % (user,ts,scoreval)
  scores[scoreval] += 1

  # write the score to the raw data file
  s = &quot;%s\n&quot; % scoreval
  f.write(s)

a=numpy.array(scorelist)
print numpy.size(a)
print numpy.std(a)
print numpy.average(a)
f.close()
</pre>
<p>Now, the good stuff.  Out of  7,089 twifficiency tweets from yesterday(the gardenhose has been severly throttled lately), the scores range from 0-99.  I think I remember @jamescun mentioning the max score is %100, I haven&#8217;t seen it yet.  The mean Twifficiency score is 38.5285 and the standard deviation is 11.1036.   The score at the 25th percentile is  32, the median score is 39, and the score at the 75th percentile is 47.</p>
<p>After loading into R and plotting a histogram, the scores seem to follow a pretty normal distribution(<strong>update</strong>: it&#8217;s not-Check out <a href="http://www.twitter.com/johnmyleswhite">@johnmyleswhite</a>&#8216;s comment below):</p>
<div id="attachment_322" class="wp-caption alignnone" style="width: 730px"><a href="http://www.neilkodner.com/wp-content/uploads/2010/08/twifficiencyfreq.png"><img class="size-full wp-image-322 " title="twifficiencyfreq" src="http://www.neilkodner.com/wp-content/uploads/2010/08/twifficiencyfreq.png" alt="" width="720" height="587" /></a><p class="wp-caption-text">Distribution of Twifficiency scores</p></div>
<p>Although i consider myself a prominent auto-tweeter(<a href="http://www.twitter.com/thebotlebowski">@TheBotLebowski</a> <a href="http://www.twitter.com/hellooooonewman">@HelloooooNewman</a> <a href="http://www.twitter.com/acenterforants">@ACenterForAnts</a>), I&#8217;m not crazy about the idea of having an app send a tweet without permission.  Lets hope @jamescun fixes this or at least clarifies it a little better.  Having said that, I love seeing this type of story and with James the best even if he is emo(Just kidding James!!).</p>
<p>edit: the raw twifficiency score data may be found <a href="http://www.neilkodner.com/twifficiencyscores.txt">here</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.neilkodner.com/2010/08/twifficiency-scores-analyzed-and-visualized/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>And you thought you were the first to use #DONTFUCKWITHJUSTINBIEBER</title>
		<link>http://www.neilkodner.com/2010/08/and-you-thought-you-were-the-first-to-use-dontfuckwithjustinbieber/</link>
		<comments>http://www.neilkodner.com/2010/08/and-you-thought-you-were-the-first-to-use-dontfuckwithjustinbieber/#comments</comments>
		<pubDate>Mon, 09 Aug 2010 16:58:55 +0000</pubDate>
		<dc:creator>admin</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[analysis]]></category>
		<category><![CDATA[data]]></category>
		<category><![CDATA[funny]]></category>
		<category><![CDATA[hadoop]]></category>
		<category><![CDATA[hashtag]]></category>
		<category><![CDATA[pig]]></category>
		<category><![CDATA[twitter]]></category>

		<guid isPermaLink="false">http://www.neilkodner.com/?p=287</guid>
		<description><![CDATA[Through the magic of hadoop, pig, over 300 million(and counting) tweets, and the never-ending creativity of my fellow twitter users, I thought I&#8217;d take a look at all of the hashtags containing the beloved f-word. Lets get the technical details out of the way.  Since the middle of June, I&#8217;ve been saving as many tweets [...]]]></description>
			<content:encoded><![CDATA[<p>Through the magic of <a href="http://hadoop.apache.org/">hadoop</a>, <a href="http://hadoop.apache.org/pig">pig</a>, over 300 million(and counting) tweets, and the never-ending creativity of my fellow twitter users, I thought I&#8217;d take a look at all of the hashtags containing the beloved f-word.</p>
<p>Lets get the technical details out of the way.  Since the middle of June, I&#8217;ve been saving as many tweets as I can to local storage, using Twitter&#8217;s streaming API and my gardenhose access.  Sorry, <a href="http://www.cloudera.com">Cloudera</a> guys, I&#8217;m not yet using <a href="http://github.com/cloudera/flume">flume</a>, but it&#8217;s high on the to-do list.  Using a 3-node cluster, I&#8217;m able to search through these tweets and extract valuable(?) data in a matter of minutes.</p>
<p>The pig script(Sorry, looks like gist.github.com doesn&#8217;t auto-format pig):<br />
<script src="http://gist.github.com/515641.js"></script></p>
<p>And now the fun stuff.  I found over 31,000 different hashtags containing the f-word.  Bonus to the first person who can tell me what GFW is.<br />
The top-ten results and the frequency of their mentions are:</p>
<pre class="brush: plain;">
#fuck	10406
#fuckouttahere	3172
#fuckinfollow	3062
#fuckit	2970
#fuckyou	2303
#fuckgfw	1573
#fuckyeah	1551
#fuckery	1436
#fucking	1273
#fuckoff	988
</pre>
<p>Lets move on to what&#8217;s really important-celebrities, sports figures, and other important American topics:</p>
<p>Lady Gaga</p>
<pre class="brush: plain;">
#dontfuckwiththegaga	11
#fuckthegagahaters	4
#fuckgaga	3
#fuckladygaga	3
#dontfuckwithgaga	2
#fuckmegaga	2
#fucktrannygaga	1
</pre>
<p><span id="more-287"></span></p>
<p>Obama</p>
<pre class="brush: plain;">
#fuckobama	7
#dontfuckwithobama	2
#fuckyouobama	1
#fuckbarackobama	1
#fucking_twat_obama	1
#obamabrieflymadefuckedmeover	1
</pre>
<p>taxes</p>
<pre class="brush: plain;">
#fucktaxes	5
#blackpeopleneverpaybillsfuckinuptaxescredit	1
#fuckingtaxes	1
</pre>
<p>The NY Yankees</p>
<pre class="brush: plain;">
#fucktheyankees	31
#fuckyankees	3
#fuckdayankees	1
#fuckyouyankees	1
</pre>
<p>The Red Sox</p>
<pre class="brush: plain;">
#fucktheredsox	2
#fuckredsox	1
#fuckredsoxfans	1
#fucktheredsoxs	1
#ifuckinghaaaateredsoxanddavidortizandhisslowfatassshouldveputarodinmaybewouldvewon	1
#fucktheredsoxandanybodywhoplaysforthem	1
</pre>
<p>Lakers</p>
<pre class="brush: plain;">
#fuckthelakers	279
#fucklakers	65
#teamfuckthelakers	48
#fuckdalakers	39
#fuckteamlakers	14
#teamfuckdalakers	9
#teamfuckinglakers	8
#teamfucklakers	7
#fuckyoulakers	6
...
#itsstillfucklakersalldayeverydaytillkoberetires	1
#teamidontgiveafuckaboutlakersorcelticskickrockd	1
#fuckdaflakers	1
#fuckyealakers	1
#fuckalakersfan	1
#fucklakersssss	1
#fucktheflakers	1
#fuckyourlakers	1
#fuckeverylakersfanonthegotdamnplanetcuztheyaintshitforeal	1
</pre>
<p>Celtics</p>
<pre class="brush: plain;">
#fucktheceltics	43
#fuckceltics	39
#teamfuckceltics	16
#fuckteamceltics	8
#teamfucktheceltics	5
#teammmmfuckingceltics	4
#fuckdaceltics	3
#teamfuckthecelticsandlakers	3
#fuckthemceltics	3
#fuckyouceltics	2
...
#fuckthelakersandceltics	1
#fuckyourfeelingsceltics	1
#teamcelticsallfuckingday	1
#fuckthecelticsandthehaters	1
#teamifuckinghatetheceltics	1
#fuckthelakersfucktheceltics	1
#teamfuckthecelticswith10dicks	1
</pre>
<p>Lebron</p>
<pre class="brush: plain;">
#fucklebron	366
#teamfucklebron	45
#fucklebronjames	32
#fuckyoulebron	18
#fucklebronandhisdecision	16
#fuckalebron	4
#teamfucklebronjames	4
#newyorksaysfucklebron	4
#fuckyoutolebron	3
#fucklebronforlife	2
#fucklebronbitchass	2

(many more)
...
#teamgetthefuckofflebrondickhalfofyalgotsummerschooldoyourhomeworkyoudickriders	1
#fuckouttaherelebrons	1
#fucklebronheabitchassniggaheisnotarealmanfaggotassbitchassdickridinassthatswhyhesecondtodwade	1
</pre>
<p>Haters</p>
<pre class="brush: plain;">
#fuckthehaters	67
#fuckhaters	25
#fuckyouhaters	15
#fuckinhaters	7
#fuckjlshaters	6
#fuckdahaters	6
#demihatersfuck	4
#fuckdemihaters	4
</pre>
<p>Snitches</p>
<pre class="brush: plain;">
#fucksnitches	3
#fuckinsnitches	1
</pre>
<p>and finally, who could forget J-Bieb</p>
<pre class="brush: plain;">
#fuckyoubieberisafag	126
#dontfuckwithjustinbieber	105
#fuckjustinbieber	10
#fuckbieber	10
#bieberisafagshouldshutthefuckup	6
#fuckyoubieber	6
#teamfuckbieber	5
#fuckoffbieberarmy	5
#dontfuckwithbieber	5
#biebersnewhaircutisfuckinsexysostfuitsjusthairitwillgrowbackgetafuckinlifebitches	4
#whothefuckisjustinbieber	3
#fuckingunfollowbieberarmy	3
#dontfuckwithjustinbieberslegalbeliebers	3
#ifuckbieber	2
(many many more....)
#fuckyeahjustinbiebermix	1
#fuckinunfollowbieberarmy	1
#ohmyfuckinjustinbiebergasm	1
#fuckthattinylittlebieberfag	1
#fuckyoubiebertyzasranyklamco	1
#justinbieberisafuckingpussyshit	1
#biebersafagshouldgetafuckinglife	1
#fuckyouallthehatersofjustinbieber	1
#ilovejustindrewbieberfuckwatucare	1
#fuckjustinbieberbringfalloutboyback	1
#whothefuckisstilltrendingjustinbieber	1
#youstupidbieberhatersneedafuckinglife	1
#fuckjustinbieberinhisstupidlookingbangs	1
</pre>
<p>And how about those who can&#8217;t spell Bieber?</p>
<pre class="brush: plain;">
#fuckbeiber	1
#fuckyoubeiber	1
#teamfuckbeiber	1
#fuckjustinbeiber	1
#dontfuckwithjustinbeiber	1
#fuckinwiththatjustbeiber	1
#justinbeibershouldgofuckhimself	1
</pre>
<p>We&#8217;ve got geographic locations covered as well<br />
Jersey/New York/Philly</p>
<pre class="brush: plain;">
#fuckjerseyshore	22
#fuckphilly	7
#fucknewjersey	4
#newyorksaysfucklebron	4
#fuckthegirlswhomadejustincriedandrolledoffbackstageinnewjersey	3
#teamfuckjerseyshore	3
#jerseyfuckingshore	2
#ifuckinglovenewyork	2
#jerseyshoreisfulloffucks	2
#fuckmarryorkill	2
#fuckjersey	1
#fatitalianmufuckawitthebeardfromsouthphilly	1
#fucknewyork	1
#fuckyouphilly	1
#fuckmaryorkill	1
#fuckyounewjersey	1
#fuckjerseytransit	1
#fuckthejerseyshore	1
#ifuckinglovejersey	1
#jerseyshorefuckery	1
#fuckyounewyorkstate	1
#fuckyouphillypeople	1
</pre>
<p>France</p>
<pre class="brush: plain;">
#fuckfrance	13
#fuckyoufrance	1
</pre>
<p>Spain</p>
<pre class="brush: plain;">
#fuckspain	31
#fuckyouspain	4
#gofuckyourselfspain	3
#doublefuckspain	3
#teamfuckcaresboutspain	2
#teamfuckinspain	2
#fuckuspain	1
#fuckingspain	1
</pre>
<p>Work</p>
<pre class="brush: plain;">
#fuckwork	73
#whenthefuckamiworkingnextcauseireallyneedsomemoneyasap	6
#fuckyouwork	6
#teamfuckwork	5
#fuckfireworks	4
#fuckcoworkers	3
#fuckworking	3
#fuckbuyinfireworksbuybullets	3
#fuckingwork	3
#fuckworkofart	3
#fuckworktomorrow	3
#fuckyocoworker	2
#putthemfuckinheelsonandworkitgirl	2
#fuckhomework	2
#teamfucksleepgotoworktiredandstillgetthemoneyswag	2
#teamboredasfucktonightcauseireallydontcareaboutfireworks	2
#fuckgoingtowork	2
#fuckinwork	1
#fuckanetwork	1
#fuckfirworks	1
</pre>
<p>School</p>
<pre class="brush: plain;">
#fuckschool	80
#fucksummerschool	11
#teamfucksummerschool	10
#teamfuckschool	5
#schoolisfuckery	4
#fuckhighschool	2
#schoolfucksmylife	2
#fuckhighschoolconfessions	2
#fuckyouschool	2
#fuckkkschool	1
</pre>
<p>A great one suggested by my friend Ken</p>
<p>Police</p>
<pre class="brush: plain;">
#fuckthepolice	388
#fuckdapolice	102
#fuckthapolice	26
#teamfuckthepolice	8
#fuckpolice	4
#teamfuckdapolice	3
#fuckdapolicetweet	2
#fuckthepolice2010	2
#policesayfuckofftomedia	2
#fuck_the_police	1
#fuckthepolicex3	1
#fuckgrammarpolice	1
#fuckthapoliceyeah	1
</pre>
<p>Other observations:<br />
Many more mentions of math than science OR homework<br />
A few mentions of Lance but none of Contador</p>
<p>The full dataset can be downloaded <a href="http://www.neilkodner.com/fwordhashtags.txt">here</a>.  The top ten thousand most frequently occurring hashtags can be found <a href="http://www.neilkodner.com/toptenkfwordhashtags.txt">here</a>.</p>
<p>To-do: modify the pig script for variances of spelling the f-word, multiple u&#8217;s, etc.  Maybe even a visualization.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.neilkodner.com/2010/08/and-you-thought-you-were-the-first-to-use-dontfuckwithjustinbieber/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
		<item>
		<title>Meta:  Please excuse the ads!</title>
		<link>http://www.neilkodner.com/2010/07/meta-please-excuse-the-ads/</link>
		<comments>http://www.neilkodner.com/2010/07/meta-please-excuse-the-ads/#comments</comments>
		<pubDate>Thu, 22 Jul 2010 11:51:09 +0000</pubDate>
		<dc:creator>admin</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://www.neilkodner.com/?p=284</guid>
		<description><![CDATA[As part of testing for another project, I&#8217;m experimenting with Google Ads on the site. They shouldn&#8217;t be around for more than a day or two.]]></description>
			<content:encoded><![CDATA[<p>As part of testing for another project, I&#8217;m experimenting with Google Ads on the site.  They shouldn&#8217;t be around for more than a day or two.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.neilkodner.com/2010/07/meta-please-excuse-the-ads/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>RT @MyloGang: IT WAS JUST A FUCKIN EARTHQUAKE WTF</title>
		<link>http://www.neilkodner.com/2010/07/rt-mylogang-it-was-just-a-fuckin-earthquake-wtf/</link>
		<comments>http://www.neilkodner.com/2010/07/rt-mylogang-it-was-just-a-fuckin-earthquake-wtf/#comments</comments>
		<pubDate>Fri, 16 Jul 2010 11:40:52 +0000</pubDate>
		<dc:creator>admin</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://www.neilkodner.com/?p=280</guid>
		<description><![CDATA[I woke up to @wahalulu asking me to check my tweets for mentions of the DC Earthquake and am happy to oblige:  I&#8217;ll re-run my numbers throughout the day, but here are the mentions since about 7:30 this morning EST.    Raw data here.]]></description>
			<content:encoded><![CDATA[<p>I woke up to <a href="http://twitter.com/wahalulu/">@wahalulu</a> asking me to<a href="http://twitter.com/wahalulu/statuses/18673565690"> check my tweets for mentions of the DC Earthquake</a> and am happy to oblige:  I&#8217;ll re-run my numbers throughout the day, but here are the mentions since about 7:30 this morning EST.    Raw data <a href="http://www.neilkodner.com/earthquaketweets.txt">here</a>.</p>
<p><a href="http://www.neilkodner.com/images/littlesnapper/dc_earthquake_wordle.png"><img class="alignnone" title="earthquake tweets" src="http://www.neilkodner.com/images/littlesnapper/dc_earthquake_wordle.png" alt="earthquake" width="1179" height="701" /></a></p>
]]></content:encoded>
			<wfw:commentRss>http://www.neilkodner.com/2010/07/rt-mylogang-it-was-just-a-fuckin-earthquake-wtf/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>World Cup Country hashtag mentions through 190gb of tweets</title>
		<link>http://www.neilkodner.com/2010/06/world-cup-country-hashtag-mentions-through-190gb-of-tweets/</link>
		<comments>http://www.neilkodner.com/2010/06/world-cup-country-hashtag-mentions-through-190gb-of-tweets/#comments</comments>
		<pubDate>Mon, 28 Jun 2010 15:41:35 +0000</pubDate>
		<dc:creator>admin</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://www.neilkodner.com/?p=266</guid>
		<description><![CDATA[Since the beginning of the 2010 World Cup, I&#8217;ve been saving tweets from the twitter gardenhose and trying to find interesting things in the data.  Here is a histogram showing the count of mentions for each country&#8217;s hashtag.  With apologies for my lack of effort in ggplot2: Even though my raw data was sorted by [...]]]></description>
			<content:encoded><![CDATA[<p>Since the beginning of the 2010 World Cup, I&#8217;ve been saving tweets from the twitter gardenhose and trying to find interesting things in the data.  Here is a histogram showing the count of mentions for each country&#8217;s hashtag.  With apologies for my lack of effort in ggplot2:</p>
<p><a href="http://www.neilkodner.com/images/littlesnapper/country_hashtags.png"><img class="alignnone" title="World Coup country hashtags" src="http://www.neilkodner.com/images/littlesnapper/country_hashtags.png" alt="" width="1145" height="650" /></a></p>
<p>Even though my raw data was sorted by the counts, it appears that the default behavior of ggplot2 (or at least qplot) is an alphabetical sort.  Maybe one of you could help me wi this.  Source data is below.</p>
<p>Since my magnficient 2-node hadoop cluster consisting of my MacBook Pro, an old beat-up MacBook and a wireless connection isn&#8217;t quite mighty enough, I generated these numbers the old-fashioned way- through the command line.  I&#8217;m sitting on too much unprocessed data to send to Amazon S3 for EMR.  After I preprocess the tweets, the size will drastically reduce and I can then send the data to Amazon for further processing.</p>
<pre class="brush: bash;">
cat *.json|grep -iPo &quot;#(usa|mex|hon|bra|par|chi|arg|uru|alg|civ|gha|nga|cmr|rsa|prk|jpn|kor|aus|nzl|eng|fra|esp|por|ned|den|ger|sui|ita|svk|svn|srb|gre)\b&quot;|tr '[:upper:]' '[:lower:']|sort|uniq -c|sort -rg
</pre>
<pre><span id="more-266"></span>
<pre class="brush: plain;">
country count
bra 517577
arg 157524
usa 153661
mex 144108
eng 126073
ger 123713
esp 86458
chi 85301
jpn 79508
por 72340
ita 52830
gha 52193
uru 49694
kor 39671
civ 36957
ned 33231
fra 31511
prk 29758
rsa 24705
sui 24194
nzl 21690
den 21273
hon 20621
par 18518
aus 16247
gre 16243
srb 14552
alg 14017
svk 12867
cmr 12864
svn 12350
nga 10224
</pre>
</pre>
]]></content:encoded>
			<wfw:commentRss>http://www.neilkodner.com/2010/06/world-cup-country-hashtag-mentions-through-190gb-of-tweets/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Words mentioned in 23-Jun-2010 Canadian Earthquake tweets</title>
		<link>http://www.neilkodner.com/2010/06/words-mentioned-in-23-jun-2010-earthquake-tweets/</link>
		<comments>http://www.neilkodner.com/2010/06/words-mentioned-in-23-jun-2010-earthquake-tweets/#comments</comments>
		<pubDate>Thu, 24 Jun 2010 16:15:33 +0000</pubDate>
		<dc:creator>admin</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[data]]></category>
		<category><![CDATA[earthquake]]></category>
		<category><![CDATA[mapreduce]]></category>
		<category><![CDATA[news]]></category>
		<category><![CDATA[python]]></category>
		<category><![CDATA[twitter]]></category>

		<guid isPermaLink="false">http://www.neilkodner.com/?p=256</guid>
		<description><![CDATA[Using twitter gardenhose access, remove stopwords and punctuation sprinkle in a little bit of mapping, some reducing, and voila! The most frequently-occurring words in tweets that mentioned earthquake from June 23, 2010. I left earthquake out of the image itself because being that it was in every tweet, it overwhelmed the rest of the words. [...]]]></description>
			<content:encoded><![CDATA[<div class="wp-caption alignnone" style="width: 1011px"><img class=" " title="words mentioned in earthquake tweets 23-jan-2010" src="http://www.neilkodner.com/images/littlesnapper/words%20mentioned%20in%20yesterdays%20earthquake%20tweets.png" alt="" width="1001" height="599" /><p class="wp-caption-text">words mentioned in earthquake tweets 23-jan-2010 </p></div>
<p>Using twitter gardenhose access, remove stopwords and punctuation sprinkle in a little bit of mapping, some reducing, and voila! The most frequently-occurring words in tweets that mentioned earthquake from June 23, 2010. I left earthquake out of the image itself because being that it was in every tweet, it overwhelmed the rest of the words.  I find it amazing that the most frequently occurring &#8216;word&#8217; is RT.</p>
<p>Also, wordle seemed to strip out numeric &#8216;words&#8217; which is a shame because people tweeted the magnitude left-and-right.  See the data below for the top 100 words.</p>
<p><span id="more-256"></span></p>
<pre>
<div id="_mcePaste">
<pre>
survey:79
hey:79
seriously:80
preliminary:81
info:81
strikes:81
hell:82
4.5:82
gta:82
geological:83
magnitude50:83
check:83
call:84
service:86
globeandmail:86
triggered:87
experience:88
video:89
earthquakes:90
guess:91
caused:92
pm:92
fuck:93
ground:94
bad:94
5.7:94
move:96
minutes:98
2.3:98
damn:99
mini:99
eastern:100
scared:100
will:100
philippec:101
cool:103
northern:104
live:106
struck:107
city:112
work:112
floor:113
epicenter:113
pretty:113
nyc:113
seconds:113
todays:113
afternoon:114
feeling:116
safe:116
haha:117
central:117
tremor:117
whoa:118
downtown:120
rattles:121
warning:121
damage:122
separating:125
guys:126
god:126
tweets:129
heard:129
tremors:130
north:131
rochester:131
earth:134
small:135
california:136
missed:137
cp24:138
finally:139
minor:139
fake:141
scary:141
detroit:143
big:146
breaking:147
coming:151
good:152
weird:156
area:159
experienced:161
lake:161
buffalo:163
happened:163
2010:164
survived:168
hope:172
thing:172
evacuated:173
canadian:176
cleveland:176
tsunami:177
region:178
office:182
shake:184
reported:185
ontarioquebec:187
buildings:193
hits:194
ago:200
house:200
york:200
ohio:205
going:207
border:213
shit:214
usgs:215
quake:216
michigan:217
5.0:222
omg:238
time:240
reports:241
holy:241
southern:242
shook:249
twitter:252
day:253
crazy:255
shakes:270
people:276
wtf:308
building:309
ny:355
thought:362
tornado:375
montreal:375
hit:388
lol:406
quebec:414
wow:418
shaking:423
news:433
g20:490
today:577
magnitude:612
ontario:781
5.5:988
ottawa:1046
canada:1373
feel:1439
toronto:2086
felt:2146
rt:4046
earthquake:14918</pre>
</div>
<pre class="brush: plain;">survey:79hey:79seriously:80preliminary:81info:81strikes:81hell:824.5:82gta:82geological:83magnitude50:83check:83call:84service:86globeandmail:86triggered:87experience:88video:89earthquakes:90guess:91caused:92pm:92fuck:93ground:94bad:945.7:94move:96minutes:982.3:98damn:99mini:99eastern:100scared:100will:100philippec:101cool:103northern:104live:106struck:107city:112work:112floor:113epicenter:113pretty:113nyc:113seconds:113todays:113afternoon:114feeling:116safe:116haha:117central:117tremor:117whoa:118downtown:120rattles:121warning:121damage:122separating:125guys:126god:126tweets:129heard:129tremors:130north:131rochester:131earth:134small:135california:136missed:137cp24:138finally:139minor:139fake:141scary:141detroit:143big:146breaking:147coming:151good:152weird:156area:159experienced:161lake:161buffalo:163happened:1632010:164survived:168hope:172thing:172evacuated:173canadian:176cleveland:176tsunami:177region:178office:182shake:184reported:185ontarioquebec:187buildings:193hits:194ago:200house:200york:200ohio:205going:207border:213shit:214usgs:215quake:216michigan:2175.0:222omg:238time:240reports:241holy:241southern:242shook:249twitter:252day:253crazy:255shakes:270people:276wtf:308building:309ny:355thought:362tornado:375montreal:375hit:388lol:406quebec:414wow:418shaking:423news:433g20:490today:577magnitude:612ontario:7815.5:988ottawa:1046canada:1373feel:1439toronto:2086felt:2146rt:4046earthquake:14918</pre>
</pre>
]]></content:encoded>
			<wfw:commentRss>http://www.neilkodner.com/2010/06/words-mentioned-in-23-jun-2010-earthquake-tweets/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>A free, simple way to backup and search your tweets</title>
		<link>http://www.neilkodner.com/2010/06/a-free-simple-way-to-backup-and-search-your-tweets/</link>
		<comments>http://www.neilkodner.com/2010/06/a-free-simple-way-to-backup-and-search-your-tweets/#comments</comments>
		<pubDate>Wed, 16 Jun 2010 17:35:23 +0000</pubDate>
		<dc:creator>admin</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://www.neilkodner.com/?p=212</guid>
		<description><![CDATA[Edit: I&#8217;m not sure why wordpress decided to resize some of my pictures, please let me know how to avoid this. It&#8217;s not like Twitter is going to go all Ma.gnolia on us and lose all of our data but here&#8217;s a way to back up and search through your prior tweets. I&#8217;ve been using [...]]]></description>
			<content:encoded><![CDATA[<p>Edit: I&#8217;m not sure why wordpress decided to resize some of my pictures, please <a href="http://www.twitter.com/neilkod">let me know</a> how to avoid this.</p>
<p>It&#8217;s not like Twitter is going to <a href="http://www.wired.com/epicenter/2009/01/magnolia-suffer/">go all Ma.gnolia on us and lose all of our dat</a>a but here&#8217;s a way to back up and search through your prior tweets. I&#8217;ve been using this September of last year and it seems to work really well.  You can even save others&#8217; tweets, provided their timeline is public.</p>
<p><a title="Yahoo Alerts" href="http://alerts.yahoo.com/">Yahoo Alerts</a>, a free email and SMS notification service has a useful option send an alert whenever a RSS Feed is updated.  Not surprisingly, each of our twitter streams has its own RSS feed.  Combining the two, we can create a process that, behind-the-scenes, sends each tweet to our gmail(or wherever) accounts.</p>
<p>First, a new alert has to be created.  Make sure you choose <strong>Feed/Blog</strong> as the alert type:</p>
<p><a href="http://www.neilkodner.com/images/littlesnapper/create%20feed/Blog%20alert.png"><img class="alignnone" title="Creating a blog alert Feed" src="http://www.neilkodner.com/images/littlesnapper/create%20feed/Blog%20alert.png" alt="" width="832" height="521" /></a></p>
<p>Then, for the feed&#8217;s URL, enter</p>
<pre class="brush: plain;">http://twitter.com/statuses/user_timeline/neilkod.rss</pre>
<p><strong>replacing my twitter username(neilkod) with your own.</strong> I chose to have the alert sent to my gmail account and use the <strong>+twitterbackup suffix</strong> as an identifier.  I chose to have the alerts sent as they&#8217;re created.</p>
<div class="wp-caption alignnone" style="width: 958px"><a href="http://www.neilkodner.com/images/littlesnapper/createAlert.png"><img title="filter options" src="http://www.neilkodner.com/images/littlesnapper/createAlert.png" alt="" width="948" height="970" /></a><p class="wp-caption-text">filter options</p></div>
<p>We&#8217;re halfway there.  Now to organize the filters.  We can create a custom gmail filter that looks for messages sent from Yahoo that use our special suffix, twitterbackup in this example.</p>
<p><a href="http://www.neilkodner.com/images/littlesnapper/create%20gmail%20filter.png"><img class="alignnone" title="create a gmail filter" src="http://www.neilkodner.com/images/littlesnapper/create%20gmail%20filter.png" alt="create a gmail filter" width="1095" height="343" /></a></p>
<p>In my case, I chose to have the label TwitterBackup applied, mark the message as read, and archive it.</p>
<p><img class="alignnone" title="filter settings" src="http://www.neilkodner.com/images/littlesnapper/filter%20settings.png" alt="gmail filter settings" width="1092" height="395" /></p>
<p>So now, all of our tweets are silently loaded into our gmail account, easily retrieved if we ever need them.  This is helpful because Twitter&#8217;s search only goes back so far.</p>
<p>From there, searching is just a matter of looking for keywords within a certain label.  Let&#8217;s search for skate:</p>
<p><img class="alignnone" title="skate search" src="http://www.neilkodner.com/images/littlesnapper/skate%20search.png" alt="skate search" width="861" height="627" /></p>
<p>And here is the message detail.  You get some of the typical yahoo &#8216;noise&#8217; at the end of the email, but hey, it&#8217;s free!</p>
<p><img class="alignnone" title="skate search result" src="http://www.neilkodner.com/images/littlesnapper/skatesearchresult.png" alt="skate search result" width="1059" height="818" /></p>
<p>And finally, back to the original tweet, with apologies to <a href="http://www.twitter.com/iamjameshall">@IamJamesHall</a></p>
<p><a href="http://twitter.com/neilkod/statuses/4835222255">http://twitter.com/neilkod/statuses/4835222255</a></p>
<p>Which is especially nice because twitter search has no recollection of me ever mentioning skate, especially not way back in October of 2009!(but hey, at least its fast!).</p>
<p><img class="alignnone" title="twitterskatesearch" src="http://www.neilkodner.com/images/littlesnapper/twittersearch.png" alt="twitterskatesearch" width="791" height="248" /></p>
<p>I&#8217;d eventually like to port this to use Yahoo Pipes rather than Google Alerts to receive the data in a more concise format, but this method gets the job done.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.neilkodner.com/2010/06/a-free-simple-way-to-backup-and-search-your-tweets/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>Hacking Seinfeld Tweets with Apache Pig &#8211; A work in progress</title>
		<link>http://www.neilkodner.com/2010/04/hacking-seinfeld-tweets-with-apache-pig-a-work-in-progress/</link>
		<comments>http://www.neilkodner.com/2010/04/hacking-seinfeld-tweets-with-apache-pig-a-work-in-progress/#comments</comments>
		<pubDate>Fri, 23 Apr 2010 17:41:57 +0000</pubDate>
		<dc:creator>admin</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://www.neilkodner.com/?p=190</guid>
		<description><![CDATA[As some of you know, my twitter bot @hellooooonewman responds to every tweet containing the word/hashtag &#8216;Seinfeld&#8217;.  Using Python and the Twitter Search REST API, it looks for mentions and then replies to the original author with a random Seinfeld quote.  People seem to get a kick out of it, judging by its 2,000 followers [...]]]></description>
			<content:encoded><![CDATA[<p>As some of you know, my twitter bot <a href="http://www.twitter.com/hellooooonewman">@hellooooonewman</a> responds to every tweet containing the word/hashtag &#8216;Seinfeld&#8217;.  Using Python and the Twitter Search REST API, it looks for mentions and then replies to the original author with a random Seinfeld quote.  People seem to get a kick out of it, judging by its 2,000 followers in about a month&#8217;s time.</p>
<p>What most don&#8217;t realize is that I&#8217;m capturing data each time my program finds a search result.  I capture:</p>
<ul>
<li>Time of day the tweet was received (in California time, currently PDT)</li>
<li>Tweet ID of the tweet</li>
<li>The author of the tweet</li>
<li>The tweet itself</li>
<li>The reply sent back to the original author</li>
</ul>
<p>In a quest to become a former Oracle DBA(not that there&#8217;s anything wrong with that!) and move into working with big data, Hadoop, etc, I&#8217;ve been spending a lot of time working with <a title="Apache Pig" href="http://hadoop.apache.org/pig/" target="_blank">Apache Pig</a>, the tool I have currently chosen to analyze large data sets.   After running through the tutorial programs, I thought I would try my hand at some of my own queries.  Here are a few results.</p>
<p>Warning: I went after some seriously low-hanging fruit for these experiments</p>
<p>Using data collected 21-Mar-2010 through 23-Apr-2010, I&#8217;ve captured 21,000* tweets about Seinfeld.</p>
<p>*Yes I know this isn&#8217;t &#8220;Big Data&#8221;, I&#8217;m just screwing around.</p>
<p>First thing I wanted to find out was what time of the day, specifically what hour were people tweeting about Seinfeld.    Using SQL, this would have been a no-brainer</p>
<pre class="brush: sql;">
  select  to_char(timestamp,'HH') hour
       ,  count(*)
    from  seinfeld_tweets
group by  to_char(timestamp,'HH')
order by  to_char(timestamp,'HH');
</pre>
<p>Easy enough, right?  Fortunately, doing the same in Pig isn&#8217;t all that bad.  The big whoa-moment (Think Keanu Reeves in The Matrix)  when using Pig is realizing that the grouping and aggregation take place in different steps.  Coming from a strong SQL background, that&#8217;s just weird.</p>
<p>Since I haven&#8217;t really mastered Pig&#8217;s date-handling, I cheated by just using SUBSTRING() to capture the hour of the tweet.  A few helpful folks in the Cloudera irc channel told me that Pig&#8217;s date-handling functionality is still pretty rough, I decided to take the easy way out so that I can stick with the core learning, and not be fighting with built-in functions that don&#8217;t work entirely as they should.</p>
<p>The equivalent Pig code is</p>
<pre class="brush: plain;">
A = load '/Users/neil/seinfeld.log' using PigStorage('\t') as (timestamp:chararray,bot:chararray,id:chararray,author:chararray,tweet:chararray,response:chararray);
B = FOREACH A GENERATE author, org.apache.pig.piggybank.evaluation.string.SUBSTRING(timestamp,11,13) as hr;
grpd = GROUP B by hr;
cntd = FOREACH grpd GENERATE $0,COUNT(B);
</pre>
<p>Not too bad, Line 1 loads the Seinfeld log into A, Line 2 extracts the author of the tweet and the Hour. Line 3 groups by the value of hour, line 4 computes the tweets per hour.  I didn&#8217;t have to sort the data because the output was already sorted in the order I wanted.  This was unexpected.  I dont know if I got lucky or this is desired behavior.  A stark difference from Oracle where you can never really expect any order to your data, even aggregates, unless you include an ORDER BY clause.</p>
<p><a href="http://www.neilkodner.com/wp-content/uploads/2010/04/Seinfeld-Tweets1.png"><img class="alignnone size-large wp-image-205" title="Seinfeld Tweets" src="http://www.neilkodner.com/wp-content/uploads/2010/04/Seinfeld-Tweets1-1024x768.png" alt="" width="1024" height="768" /></a></p>
<p>Unscientifically, people tend to tweet more about Seinfeld as the day goes on.  The times used in the raw data is the time I captured each seinfeld tweet and not really when the tweet was sent, although I estimate to never be more than ten or 15 seconds behind.  I would like to modify my program to incorporate time-zones.</p>
<p>The second test was to see just who is tweeting about Seinfeld.  Out of about 216e2 tweets (a nod to my hero <a href="http://www.twitter.com/mat_kelcey">@mat_kelcey</a>) we had 15,260 distinct users mentioning Seinfeld.</p>
<p>The SQL code to generate this data is similar to the hour-of-the-day query above:</p>
<pre class="brush: sql;">
  select  author
       ,  count(*)
    from  seinfeld_tweets
group by  author
order by  count(*);
</pre>
<p>The Pig code used is equally similar:</p>
<pre class="brush: plain;">
usrs = GROUP A by author;
cntd = FOREACH usrs GENERATE $0 as user,COUNT(A) as cnt;
srtd= ORDER cntd BY cnt;
</pre>
<p>This time, I explicitly wanted to sort by the # of tweets for each user so I added the ORDER command.</p>
<p>According to R,  that&#8217;s not terribly exciting data</p>
<pre class="brush: plain;">
&gt; summary(srtd$frequency)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
  1.000   1.000   1.000   1.417   1.000  97.000
</pre>
<p>Fine, lets take a look at the top 100 Seinfeld-tweeters</p>
<pre class="brush: plain;">
&gt; last&lt;-tail(srtd,100)
&gt; summary(last$frequency)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
   9.00   10.00   14.00   18.23   20.00   97.00
</pre>
<p>Still not terribly interesting, maybe to me, anyway.  We know that most people have about 1 tweet, lets see who the big Seinfeld-tweeters were for the last month or so:</p>
<pre class="brush: plain;">
&gt; tail(last)
               author frequency
15255          nuoptv        45
15256 OctavioCalegari        48
15257    AceCostaRica        51
15258       sony_prog        58
15259       ZDFneo_TV        60
15260     kaptainmyke        97
</pre>
<p>I saved you the work of looking at the profile pages, and saved myself the work of hyperlinking to each profile.  Looks like @kaptainmyke is trying to sell us something, so he doesn&#8217;t count.  @ZDFneo_TV and @sony_prof are both bots, so the winner of the March-April Seinfeld-tweet-off is <a href="http://www.twitter.com/AceCostaRica">@AceCostaRica</a> with 51 tweets.</p>
<p>In order to advance my Pig skills, I will continue the experimentation throughout the next few days.  One of the things I&#8217;m really liking about Pig is that since each step gets its own name, they can be reused as part of an overall larger process, I&#8217;ll try and work in some examples.</p>
<p>More interesting results to come.  I spent way too much time futzing around with ggplot2 trying to produce charts that didn&#8217;t make this post; I could have been doing more with Pig.  Such is the nature of tinkering.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.neilkodner.com/2010/04/hacking-seinfeld-tweets-with-apache-pig-a-work-in-progress/feed/</wfw:commentRss>
		<slash:comments>9</slash:comments>
		</item>
		<item>
		<title>Tiger Woods Apology word cloud and word-frequencies</title>
		<link>http://www.neilkodner.com/2010/02/tiger-woods-apology-word-cloud-and-word-frequencies/</link>
		<comments>http://www.neilkodner.com/2010/02/tiger-woods-apology-word-cloud-and-word-frequencies/#comments</comments>
		<pubDate>Fri, 19 Feb 2010 19:25:04 +0000</pubDate>
		<dc:creator>admin</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://www.neilkodner.com/?p=184</guid>
		<description><![CDATA[See that tiny little &#8216;s-o-r-r-y&#8217;? Right under the giant &#8216;p-e-o-p-l-e&#8217;. He&#8217;s spoke more about others&#8217; reactions to his infidelities than he did taking responsibility for his own actions. But what did we expect? Word-frequencies: 45 &#8211; 1 Accenture &#8211; 1 Achievements &#8211; 1 And &#8211; 1 Buddhist &#8211; 1 California &#8211; 1 Center &#8211; 1 [...]]]></description>
			<content:encoded><![CDATA[<p>See that tiny little &#8216;s-o-r-r-y&#8217;?  Right under the giant &#8216;p-e-o-p-l-e&#8217;.  He&#8217;s spoke more about others&#8217; reactions to his infidelities than he did taking responsibility for his own actions.  But what did we expect?</p>
<div class="wp-caption alignnone" style="width: 1010px"><a href="http://www.neilkodner.com/images/skitch/TigerWoodsApology-20100219-141243.jpg"><img title="Tiger Woods Apology" src="http://www.neilkodner.com/images/skitch/TigerWoodsApology-20100219-141243.jpg" alt="Tiger Woods Apology" width="1000" height="550" /></a><p class="wp-caption-text">Tiger Woods Apology</p></div>
<p>Word-frequencies:<br />
45 &#8211; 1<br />
Accenture &#8211; 1<br />
Achievements &#8211; 1<br />
And &#8211; 1<br />
Buddhist &#8211; 1<br />
California &#8211; 1<br />
Center &#8211; 1<br />
Character &#8211; 1<br />
Commissioner &#8211; 1<br />
DC &#8211; 1<br />
December &#8211; 1<br />
Despite &#8211; 1<br />
Earl &#8211; 1<br />
February &#8211; 1<br />
Finally &#8211; 1<br />
Finchem &#8211; 1<br />
From &#8211; 1<br />
Good &#8211; 1<br />
However &#8211; 1<br />
Instead &#8211; 1<br />
Learning &#8211; 1<br />
Now &#8211; 1<br />
Obviously &#8211; 1<br />
PGA &#8211; 1<br />
Parents &#8211; 1<br />
Part &#8211; 1<br />
Please &#8211; 1<br />
Southern &#8211; 1<br />
Starting &#8211; 1<br />
TOUR &#8211; 1<br />
Thank &#8211; 1<br />
Thanks &#8211; 1<br />
Thanksgiving &#8211; 1<br />
That &#8211; 1<br />
Thats &#8211; 1<br />
There &#8211; 1<br />
These &#8211; 1<br />
Thirteen &#8211; 1<br />
Today &#8211; 1<br />
Washington &#8211; 1<br />
We &#8211; 1<br />
What &#8211; 1<br />
Whatever &#8211; 1<br />
Woods &#8211; 1<br />
Your &#8211; 1<br />
above &#8211; 1<br />
acceptable &#8211; 1<br />
actions &#8211; 1<br />
actively &#8211; 1<br />
admired &#8211; 1<br />
admit &#8211; 1<br />
affairs &#8211; 1<br />
again &#8211; 1<br />
age &#8211; 1<br />
ago &#8211; 1<br />
alone &#8211; 1<br />
amends &#8211; 1<br />
angers &#8211; 1<br />
answers &#8211; 1<br />
any &#8211; 1<br />
atone &#8211; 1<br />
attacked &#8211; 1<br />
aware &#8211; 1<br />
because &#8211; 1<br />
before &#8211; 1<br />
believed &#8211; 1<br />
bitterly &#8211; 1<br />
board &#8211; 1<br />
born &#8211; 1<br />
business &#8211; 1<br />
calls &#8211; 1<br />
can &#8211; 1<br />
causes &#8211; 1<br />
centered &#8211; 1<br />
changed &#8211; 1<br />
chase &#8211; 1<br />
cheated &#8211; 1<br />
cheered &#8211; 1<br />
childhood &#8211; 1<br />
closest &#8211; 1<br />
commercial &#8211; 1<br />
completely &#8211; 1<br />
concerned &#8211; 1<br />
considerable &#8211; 1<br />
continues &#8211; 1<br />
convinced &#8211; 1<br />
core &#8211; 1<br />
count &#8211; 1<br />
couple &#8211; 1<br />
craving &#8211; 1<br />
critical &#8211; 1<br />
dad &#8211; 1<br />
daughter &#8211; 1<br />
days &#8211; 1<br />
decency &#8211; 1<br />
dedicate &#8211; 1<br />
dedicated &#8211; 1<br />
deeply &#8211; 1<br />
deserved &#8211; 1<br />
deserves &#8211; 1<br />
details &#8211; 1<br />
different &#8211; 1<br />
direction &#8211; 1<br />
directly &#8211; 1<br />
directors &#8211; 1<br />
disappointed &#8211; 1<br />
disappointment &#8211; 1<br />
discussing &#8211; 1<br />
doesnt &#8211; 1<br />
doing &#8211; 1<br />
domestic &#8211; 1<br />
dreams &#8211; 1<br />
drifted &#8211; 1<br />
drugs &#8211; 1<br />
early &#8211; 1<br />
education &#8211; 1<br />
emails &#8211; 1<br />
embarrassed &#8211; 1<br />
encouragement &#8211; 1<br />
end &#8211; 1<br />
endorsements &#8211; 1<br />
engaged &#8211; 1<br />
enjoy &#8211; 1<br />
enormous &#8211; 1<br />
entire &#8211; 1<br />
entitled &#8211; 1<br />
envisioned &#8211; 1<br />
episode &#8211; 1<br />
especially &#8211; 1<br />
example &#8211; 1<br />
expressing &#8211; 1<br />
fabricate &#8211; 1<br />
facing &#8211; 1<br />
failures &#8211; 1<br />
faith &#8211; 1<br />
false &#8211; 1<br />
fame &#8211; 1<br />
families &#8211; 1<br />
fans &#8211; 1<br />
fellow &#8211; 1<br />
field &#8211; 1<br />
first &#8211; 1<br />
focus &#8211; 1<br />
follow &#8211; 1<br />
form &#8211; 1<br />
game &#8211; 1<br />
grace &#8211; 1<br />
grow &#8211; 1<br />
guidance &#8211; 1<br />
happened &#8211; 1<br />
heard &#8211; 1<br />
heart &#8211; 1<br />
helping &#8211; 1<br />
her &#8211; 1<br />
here &#8211; 1<br />
hit &#8211; 1<br />
home &#8211; 1<br />
hope &#8211; 1<br />
however &#8211; 1<br />
hurting &#8211; 1<br />
husband &#8211; 1<br />
importance &#8211; 1<br />
importantly &#8211; 1<br />
impulse &#8211; 1<br />
including &#8211; 1<br />
inpatient &#8211; 1<br />
integrity &#8211; 1<br />
intend &#8211; 1<br />
joining &#8211; 1<br />
just &#8211; 1<br />
keeping &#8211; 1<br />
kept &#8211; 1<br />
knew &#8211; 1<br />
learn &#8211; 1<br />
letters &#8211; 1<br />
live &#8211; 1<br />
lives &#8211; 1<br />
location &#8211; 1<br />
long &#8211; 1<br />
looking &#8211; 1<br />
lost &#8211; 1<br />
maintain &#8211; 1<br />
man &#8211; 1<br />
married &#8211; 1<br />
matter &#8211; 1<br />
matters &#8211; 1<br />
media &#8211; 1<br />
millions &#8211; 1<br />
mistakes &#8211; 1<br />
model &#8211; 1<br />
mom &#8211; 1<br />
money &#8211; 1<br />
morning &#8211; 1<br />
move &#8211; 1<br />
needs &#8211; 1<br />
normal &#8211; 1<br />
now &#8211; 1<br />
once &#8211; 1<br />
ordeal &#8211; 1<br />
ourselves &#8211; 1<br />
outside &#8211; 1<br />
over &#8211; 1<br />
overcome &#8211; 1<br />
pain &#8211; 1<br />
paparazzi &#8211; 1<br />
part &#8211; 1<br />
partners &#8211; 1<br />
path &#8211; 1<br />
patience &#8211; 1<br />
peers &#8211; 1<br />
performanceenhancing &#8211; 1<br />
personal &#8211; 1<br />
personally &#8211; 1<br />
phone &#8211; 1<br />
photographs &#8211; 1<br />
plan &#8211; 1<br />
play &#8211; 1<br />
please &#8211; 1<br />
point &#8211; 1<br />
pointed &#8211; 1<br />
pointless &#8211; 1<br />
poise &#8211; 1<br />
position &#8211; 1<br />
practiced &#8211; 1<br />
praise &#8211; 1<br />
press &#8211; 1<br />
probably &#8211; 1<br />
proceed &#8211; 1<br />
process &#8211; 1<br />
professional &#8211; 1<br />
professionally &#8211; 1<br />
public &#8211; 1<br />
pursued &#8211; 1<br />
put &#8211; 1<br />
question &#8211; 1<br />
raised &#8211; 1<br />
ran &#8211; 1<br />
reach &#8211; 1<br />
reached &#8211; 1<br />
real &#8211; 1<br />
realize &#8211; 1<br />
reason &#8211; 1<br />
receive &#8211; 1<br />
received &#8211; 1<br />
receiving &#8211; 1<br />
recognize &#8211; 1<br />
regain &#8211; 1<br />
released &#8211; 1<br />
relying &#8211; 1<br />
remains &#8211; 1<br />
remarks &#8211; 1<br />
repeated &#8211; 1<br />
repeating &#8211; 1<br />
report &#8211; 1<br />
respectful &#8211; 1<br />
restraint &#8211; 1<br />
role &#8211; 1<br />
rule &#8211; 1<br />
said &#8211; 1<br />
sake &#8211; 1<br />
same &#8211; 1<br />
save &#8211; 1<br />
scholars &#8211; 1<br />
school &#8211; 1<br />
schools &#8211; 1<br />
search &#8211; 1<br />
security &#8211; 1<br />
seeing &#8211; 1<br />
seek &#8211; 1<br />
seeking &#8211; 1<br />
separate &#8211; 1<br />
setting &#8211; 1<br />
shame &#8211; 1<br />
shield &#8211; 1<br />
should &#8211; 1<br />
shown &#8211; 1<br />
simply &#8211; 1<br />
some &#8211; 1<br />
someday &#8211; 1<br />
somehow &#8211; 1<br />
space &#8211; 1<br />
special &#8211; 1<br />
speculated &#8211; 1<br />
spiritual &#8211; 1<br />
spotlight &#8211; 1<br />
staff &#8211; 1<br />
staked &#8211; 1<br />
start &#8211; 1<br />
started &#8211; 1<br />
starts &#8211; 1<br />
steps &#8211; 1<br />
stop &#8211; 1<br />
stopped &#8211; 1<br />
story &#8211; 1<br />
straight &#8211; 1<br />
supported &#8211; 1<br />
sure &#8211; 1<br />
taken &#8211; 1<br />
temptations &#8211; 1<br />
than &#8211; 1<br />
thats &#8211; 1<br />
they &#8211; 1<br />
think &#8211; 1<br />
thousands &#8211; 1<br />
throughout &#8211; 1<br />
times &#8211; 1<br />
today &#8211; 1<br />
together &#8211; 1<br />
tomorrow &#8211; 1<br />
track &#8211; 1<br />
treatment &#8211; 1<br />
true &#8211; 1<br />
truly &#8211; 1<br />
two &#8211; 1<br />
twoandahalfyearold &#8211; 1<br />
unchanged &#8211; 1<br />
unhappy &#8211; 1<br />
until &#8211; 1<br />
us &#8211; 1<br />
utterly &#8211; 1<br />
values &#8211; 1<br />
violence &#8211; 1<br />
wants &#8211; 1<br />
week &#8211; 1<br />
weeks &#8211; 1<br />
whatever &#8211; 1<br />
when &#8211; 1<br />
where &#8211; 1<br />
whether &#8211; 1<br />
which &#8211; 1<br />
why &#8211; 1<br />
wifes &#8211; 1<br />
wishes &#8211; 1<br />
words &#8211; 1<br />
worry &#8211; 1<br />
written &#8211; 1<br />
wrongdoings &#8211; 1<br />
year &#8211; 1<br />
Buddhism &#8211; 2<br />
But &#8211; 2<br />
In &#8211; 2<br />
It &#8211; 2<br />
My &#8211; 2<br />
People &#8211; 2<br />
The &#8211; 2<br />
This &#8211; 2<br />
When &#8211; 2<br />
achieve &#8211; 2<br />
also &#8211; 2<br />
always &#8211; 2<br />
apology &#8211; 2<br />
around &#8211; 2<br />
away &#8211; 2<br />
balance &#8211; 2<br />
become &#8211; 2<br />
better &#8211; 2<br />
blame &#8211; 2<br />
boundaries &#8211; 2<br />
brought &#8211; 2<br />
change &#8211; 2<br />
come &#8211; 2<br />
continue &#8211; 2<br />
course &#8211; 2<br />
damage &#8211; 2<br />
didnt &#8211; 2<br />
discuss &#8211; 2<br />
each &#8211; 2<br />
ever &#8211; 2<br />
far &#8211; 2<br />
felt &#8211; 2<br />
following &#8211; 2<br />
foolish &#8211; 2<br />
forward &#8211; 2<br />
foundation &#8211; 2<br />
get &#8211; 2<br />
go &#8211; 2<br />
golf &#8211; 2<br />
good &#8211; 2<br />
hard &#8211; 2<br />
hurt &#8211; 2<br />
important &#8211; 2<br />
involved &#8211; 2<br />
irresponsible &#8211; 2<br />
issue &#8211; 2<br />
issues &#8211; 2<br />
leave &#8211; 2<br />
like &#8211; 2<br />
living &#8211; 2<br />
look &#8211; 2<br />
making &#8211; 2<br />
marriage &#8211; 2<br />
means &#8211; 2<br />
most &#8211; 2<br />
mother &#8211; 2<br />
other &#8211; 2<br />
others &#8211; 2<br />
our &#8211; 2<br />
questions &#8211; 2<br />
recent &#8211; 2<br />
remain &#8211; 2<br />
rules &#8211; 2<br />
selfish &#8211; 2<br />
sponsors &#8211; 2<br />
still &#8211; 2<br />
students &#8211; 2<br />
support &#8211; 2<br />
teaches &#8211; 2<br />
through &#8211; 2<br />
time &#8211; 2<br />
tried &#8211; 2<br />
understanding &#8211; 2<br />
unfaithful &#8211; 2<br />
used &#8211; 2<br />
wanted &#8211; 2<br />
way &#8211; 2<br />
were &#8211; 2<br />
while &#8211; 2<br />
worked &#8211; 2<br />
world &#8211; 2<br />
would &#8211; 2<br />
wrong &#8211; 2<br />
years &#8211; 2<br />
your &#8211; 2<br />
youve &#8211; 2<br />
As &#8211; 3<br />
For &#8211; 3<br />
Im &#8211; 3<br />
Its &#8211; 3<br />
Many &#8211; 3<br />
Some &#8211; 3<br />
To &#8211; 3<br />
an &#8211; 3<br />
apply &#8211; 3<br />
as &#8211; 3<br />
ask &#8211; 3<br />
been &#8211; 3<br />
between &#8211; 3<br />
caused &#8211; 3<br />
day &#8211; 3<br />
down &#8211; 3<br />
every &#8211; 3<br />
everyone &#8211; 3<br />
find &#8211; 3<br />
had &#8211; 3<br />
its &#8211; 3<br />
learned &#8211; 3<br />
let &#8211; 3<br />
make &#8211; 3<br />
need &#8211; 3<br />
night &#8211; 3<br />
owe &#8211; 3<br />
person &#8211; 3<br />
players &#8211; 3<br />
private &#8211; 3<br />
really &#8211; 3<br />
return &#8211; 3<br />
right &#8211; 3<br />
sorry &#8211; 3<br />
taught &#8211; 3<br />
them &#8211; 3<br />
thought &#8211; 3<br />
understand &#8211; 3<br />
up &#8211; 3<br />
we &#8211; 3<br />
with &#8211; 3<br />
young &#8211; 3<br />
They &#8211; 4<br />
about &#8211; 4<br />
believe &#8211; 4<br />
but &#8211; 4<br />
children &#8211; 4<br />
did &#8211; 4<br />
dont &#8211; 4<br />
friends &#8211; 4<br />
how &#8211; 4<br />
lot &#8211; 4<br />
made &#8211; 4<br />
many &#8211; 4<br />
more &#8211; 4<br />
only &#8211; 4<br />
or &#8211; 4<br />
say &#8211; 4<br />
thank &#8211; 4<br />
their &#8211; 4<br />
therapy &#8211; 4<br />
there &#8211; 4<br />
these &#8211; 4<br />
those &#8211; 4<br />
at &#8211; 5<br />
by &#8211; 5<br />
could &#8211; 5<br />
done &#8211; 5<br />
help &#8211; 5<br />
kids &#8211; 5<br />
out &#8211; 5<br />
room &#8211; 5<br />
so &#8211; 5<br />
work &#8211; 5<br />
all &#8211; 6<br />
do &#8211; 6<br />
family &#8211; 6<br />
life &#8211; 6<br />
myself &#8211; 6<br />
never &#8211; 6<br />
not &#8211; 6<br />
one &#8211; 6<br />
what &#8211; 6<br />
wife &#8211; 6<br />
Ive &#8211; 7<br />
be &#8211; 7<br />
has &#8211; 7<br />
is &#8211; 7<br />
on &#8211; 7<br />
from &#8211; 8<br />
know &#8211; 8<br />
things &#8211; 8<br />
who &#8211; 8<br />
am &#8211; 9<br />
are &#8211; 9<br />
behavior &#8211; 9<br />
it &#8211; 9<br />
want &#8211; 9<br />
Elin &#8211; 10<br />
will &#8211; 10<br />
people &#8211; 11<br />
this &#8211; 11<br />
was &#8211; 11<br />
for &#8211; 19<br />
you &#8211; 19<br />
a &#8211; 22<br />
in &#8211; 23<br />
that &#8211; 24<br />
have &#8211; 28<br />
me &#8211; 29<br />
of &#8211; 29<br />
the &#8211; 38<br />
and &#8211; 48<br />
my &#8211; 53<br />
to &#8211; 77<br />
I &#8211; 105</p>
]]></content:encoded>
			<wfw:commentRss>http://www.neilkodner.com/2010/02/tiger-woods-apology-word-cloud-and-word-frequencies/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
	</channel>
</rss>
