<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>neilkodner.com &#187; r</title>
	<atom:link href="http://www.neilkodner.com/tag/r/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.neilkodner.com</link>
	<description>Data Driven.  Since 1971.</description>
	<lastBuildDate>Sun, 23 Oct 2011 16:40:04 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3.1</generator>
		<item>
		<title>Visualizations of Canabalt scores scraped from twitter</title>
		<link>http://www.neilkodner.com/2011/02/visualizations-of-canabalt-scores-scraped-from-twitter/</link>
		<comments>http://www.neilkodner.com/2011/02/visualizations-of-canabalt-scores-scraped-from-twitter/#comments</comments>
		<pubDate>Wed, 16 Feb 2011 22:56:46 +0000</pubDate>
		<dc:creator>admin</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[analysis]]></category>
		<category><![CDATA[api]]></category>
		<category><![CDATA[canabalt]]></category>
		<category><![CDATA[data]]></category>
		<category><![CDATA[python]]></category>
		<category><![CDATA[r]]></category>
		<category><![CDATA[twitter]]></category>

		<guid isPermaLink="false">http://www.neilkodner.com/?p=479</guid>
		<description><![CDATA[Canabalt, a ridiculously addicting web/IOS-device game allows one to show off their high scores, and their not-so-high scores to Twitter. Each of these tweets contains a bit of information &#8211; The score represented in meters, the method of death (hitting a wall and tumbling to my death) and the device (iPhone). Other useful information can [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://www.canabalt.com/">Canabalt</a>, a ridiculously addicting web/IOS-device game allows one to show off their high scores, and their <a href="http://twitter.com/#!/neilkod/status/37964035903324160">not-so-high scores</a> to Twitter.</p>
<p><a href="http://www.neilkodner.com/wp-content/uploads/2011/02/canabaltscore.png"><img class="alignnone size-medium wp-image-485" title="canabaltscore" src="http://www.neilkodner.com/wp-content/uploads/2011/02/canabaltscore-300x105.png" alt="" width="300" height="105" /></a></p>
<p>Each of these tweets contains a bit of information &#8211; The score represented in meters, the method of death (hitting a wall and tumbling to my death) and the device (iPhone). Other useful information can easily be extracted such as the date/time played and information about the user (name, location, friend count, follower count, etc). Over the next few weeks I aim to see what features, if any, has any influence on Canabalt scores.</p>
<p>The first thing I needed to do was capture the tweeted Canabalt scores. I have a process running on an EC2 micro instance that downloads tweets from the Twitter Streaming API based on certain key words, one of them being canabalt. The process loads each matching tweet into a MongoDB instance hosted on <a href="http://www.mongohq.com">MongoHQ.com</a>.</p>
<pre class="brush: bash; title: ; notranslate">

curl -s -u $TWITTER_USERNAME:$TWITTER_PASSWORD -d @/home/ec2-user/trackingkeywords http://stream.twitter.com/1/statuses/filter.json |/home/ec2-user/mongodb/bin/mongoimport &amp;
</pre>
<p>Where trackingkeywords is a file containing a comma-separated list of keywords that I track on twitter. Additionally, I left connection details out of the mongoimport command. You&#8217;ll need to provide a host, port, database, and collection into the mongoimport command.</p>
<p>I then run some python code to query the MongoDB instance and retrieve tweets mentioning Canabalt, based on a simple regular expression. I&#8217;m expecting the tweet to begin with &#8216;I&#8217; and contain the word Canabalt. Pretty naive but it worked fine. If it&#8217;s not a true Canabalt score, I&#8217;ll be able to determine in no time. From there, I use regular expressions to extract(for now) the score, the method of death, and the device name.</p>
<pre class="brush: python; title: ; notranslate">
def canabalt_tweets():

	# connect to MongoDB
	tweets = create_connection(False)

	# regular expression to extract components of a canabalt score
	canabalt_regexp = re.compile(r'I ran (\d{3,7})m before (.*) on my ([^.]+)\.')

	# regular expression to match tweets that begin with I ran and mention canabalt
	regexp = re.compile('^I ran .*canabalt')

	# create a MongoDB cursor(query)
	cur = tweets.conftweets.find({'text': regexp}, {'text': 1})

	# iterate through the cursor. If a tweet fits the pattern, print it.
	for item in cur:
		try:
			(score,death,device) = canabalt_regexp.search(item['text']).groups()
			print ','.join([strip_text(score),strip_text(death),strip_text(device)])
		except:
			pass
</pre>
<p>Function strip_text() is part of my data tools Bat-Utility Belt and cleans text by removing leading/trailing spaces, crlf, tabs and some other junk.</p>
<p>We now have some comma-separated data in this shape</p>
<pre class="brush: plain; title: ; notranslate">
score,death,device
2860,hitting a wall and tumbling to my death,iPhone
3427,hitting a wall and tumbling to my death,iPad
4496,hitting a wall and tumbling to my death,iPad
3635,missing another window,iPhone
2040,colliding with some enormous obstacle,iPhone
6017,somehow hitting the edge of a billboard,iPhone
8374,knocking a building down,iPhone
2939,hitting a wall and tumbling to my death,iPad
2021,turning into a fine mist,iPad
</pre>
<p>Now for some more fun &#8211; visualization and analysis. This is performed in R because, well, R is awesome. That, and I really need some more practice with R.</p>
<p>To date, I&#8217;ve collected just over 1200 Canabalt &#8216;events&#8217;. I will likely turn this into a web app if there&#8217;s enough interest.</p>
<p>A couple of summaries:</p>
<p>scores by device type:</p>
<pre class="brush: plain; title: ; notranslate">
      device count mean stddev median   max min range
      iPhone   735 4491   3882 3419.0 36332 102 36230
        iPad   284 4723   3884 4041.5 40630 104 40526
  iPod touch   189 3734   3644 2713.0 28024 102 27922&gt;
</pre>
<p>scores by type of death:</p>
<pre class="brush: plain; title: ; notranslate">
                                            death count mean stddev median   max  min range
          hitting a wall and tumbling to my death   684 4155   3481 3319.5 36332  102 36230
                           missing another window   243 5898   4981 4486.0 40630  409 40221
                         turning into a fine mist    86 3592   2698 2662.5 16441  614 15827
            colliding with some enormous obstacle    40 4768   4247 3256.5 16933  433 16500
                              falling to my death    37 4176   3160 3619.0 13573  567 13006
                       missing a crane completely    22 2950   1774 2923.5  7883  381  7502
                         knocking a building down    21 3399   2267 2849.0  8374  336  8038
                   not quite reaching a billboard    19 3098   1244 2980.0  5772  444  5328
              landing where a building used to be    17 4804   4970 3631.0 22685 1170 21515
          somehow hitting the edge of a billboard    14 5991   3827 5518.5 13547  566 12981
   just barely stumbling out of the first hallway    13  104      1  104.0   104  102     2
              somehow hitting the edge of a crane     7 5497   4835 4942.0 13275  510 12765
       riding a falling building all the way down     4 4278   2162 4195.5  6993 1727  5266
           completely  missing the entire hallway     1 1046     NA 1046.0  1046 1046     0
</pre>
<p>And now, in the spirit of killing the almighty ink-data ratio, here are some pictures:<br />
<img class="alignnone size-full wp-image-503" title="overall plot of scores" src="http://www.neilkodner.com/wp-content/uploads/2011/02/canabaltscores.png" alt="plot of scores" width="619" height="630" /></p>
<p><a href="http://www.neilkodner.com/wp-content/uploads/2011/02/bydeathfactedbytype11.png"><img class="alignnone size-large wp-image-505" title="by death faceted by device type" src="http://www.neilkodner.com/wp-content/uploads/2011/02/bydeathfactedbytype11-1024x779.png" alt="by death faceted by device type" width="717" height="545" /></a></p>
<p><a href="http://www.neilkodner.com/wp-content/uploads/2011/02/scores-by-device.png"><img class="alignnone size-full wp-image-507" title="scores by device" src="http://www.neilkodner.com/wp-content/uploads/2011/02/scores-by-device.png" alt="scores by device" width="534" height="539" /></a></p>
<p><a href="http://www.neilkodner.com/wp-content/uploads/2011/02/bydeathtype.png"><img class="alignnone size-large wp-image-509" title="bydeathtype" src="http://www.neilkodner.com/wp-content/uploads/2011/02/bydeathtype-1024x641.png" alt="by death type" width="614" height="385" /></a></p>
<p>What have we learned? So far, while my data set isn&#8217;t altogether that large(1200 events), we might have enough to make some basic observations and assumptions(correction please!). Going into this experiment I thought that iPad players would have generally higher scores. This is because of #1 the larger screen size and #2 players wouldn&#8217;t necessarily be playing &#8216;on-the-go&#8217; as they would be (I know I am) on an iPhone or iPod touch. The iPad has higher median and average scores than the other devices. I&#8217;d like to revisit this as I collect more data.</p>
<p>The leading cause of Canabalt death, by far, is hitting a wall and tumbling to one&#8217;s death. This surprised me as I thought it would be falling to death &#8211; that&#8217;s how my Canabalt games seem to end.</p>
<p>I&#8217;d like to hear your comments suggestions for new analysis, and most of all, your corrections.  You know who you are and this is how I learn. The data and python/R source can be found on <a href="https://github.com/neilkod/canabalt">github</a>.</p>
<p>The stack: Twitter Streaming API, EC2, MongoDB, Python, Regular Expressions, R</p>
<p>Things I learned working on this: <a href="http://had.co.nz/plyr/">plyr</a>(group-by and aggregation in R), sorting dataframes in R, couple of new <a href="http://had.co.nz/ggplot2/">ggplot2</a> tricks.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.neilkodner.com/2011/02/visualizations-of-canabalt-scores-scraped-from-twitter/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>A few quick observations on StackOverflow questions tagged R</title>
		<link>http://www.neilkodner.com/2009/11/a-few-quick-observations-on-stackoverflow-questions-tagged-r/</link>
		<comments>http://www.neilkodner.com/2009/11/a-few-quick-observations-on-stackoverflow-questions-tagged-r/#comments</comments>
		<pubDate>Thu, 12 Nov 2009 15:10:42 +0000</pubDate>
		<dc:creator>admin</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[analysis]]></category>
		<category><![CDATA[data]]></category>
		<category><![CDATA[r]]></category>
		<category><![CDATA[rstats]]></category>
		<category><![CDATA[stackoverflow]]></category>

		<guid isPermaLink="false">http://www.neilkodner.com/?p=59</guid>
		<description><![CDATA[While browsing through Pete Skomoroch&#8217;s delicious bookmarks(which is a full-time job in and of itself), I learned that StackOverflow.com makes their underlying q&#38;a data available. Just for fun, I wrote a few quick queries against this dataset, centered around the R tag. Here are a handful of findings &#8211; data is through 31-Oct-2009. Some of [...]]]></description>
			<content:encoded><![CDATA[<p>While browsing through <a href="http://www.datawrangling.com">Pete Skomoroch&#8217;s </a> <a href="http://delicious.com/pskomoroch/">delicious bookmarks</a>(which is a full-time job in and of itself), I learned that <a href="http://www.stackoverflow.com">StackOverflow.com</a> makes their underlying q&amp;a data <a href="http://blog.stackoverflow.com/2009/11/creative-commons-data-dump-nov-09/">available</a>.</p>
<p>Just for fun, I wrote a few quick queries against this dataset, centered around the R tag.  Here are a handful of findings &#8211; data is through 31-Oct-2009.  Some of this data is already presented in the StackOverflow site but bear with me here.</p>
<p>The most common tags associated with R are:<br />
statistics &#8211; 46<br />
ggplot2 &#8211; 20<br />
plot &#8211; 13<br />
graphics &#8211; 10<br />
vector &#8211; 9<br />
emacs &#8211; 8<br />
matrix &#8211; 8</p>
<p>We all know that Dirk, Shane, and Hadley lead the way in terms of questions answered, but who knew that chris_dubois leads the pack when it comes to answering their own question with 10?  </p>
<p>And finally, out of 20 posts totaling 32 answers <a href="http://stackoverflow.com/questions/tagged/ggplot2">tagged with ggplot2</a>(at the time),<a href="http://www.had.co.nz/">Hadley Wickham</a>, the <a href="http://www.had.co.nz/ggplot2/">package&#8217;s</a> author has only contributed three answers.  The fact that the rest of the questions were answered by users speaks <strong>volumes</strong> of the community behind ggplot2. Excellent Work, Hadley!</p>
<p>Here is my version of the leaderboard as of the end of October, 2009.</p>
<div id="attachment_63" class="wp-caption alignnone" style="width: 674px"><img src="http://www.neilkodner.com/wp-content/uploads/2009/11/r_october_leaderboard.jpg" alt="r stackoverflow leaderboard" title="r_october_leaderboard" width="664" height="1094" class="size-full wp-image-63" /><p class="wp-caption-text">r stackoverflow leaderboard</p></div>
]]></content:encoded>
			<wfw:commentRss>http://www.neilkodner.com/2009/11/a-few-quick-observations-on-stackoverflow-questions-tagged-r/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
	</channel>
</rss>

