<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
		>
<channel>
	<title>Comments for neilkodner.com</title>
	<atom:link href="http://www.neilkodner.com/comments/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.neilkodner.com</link>
	<description>Oracle, Python, R, Data, Cycling/Multisport, you name it.</description>
	<lastBuildDate>Wed, 18 Aug 2010 12:02:37 +0000</lastBuildDate>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.0.1</generator>
	<item>
		<title>Comment on Twifficiency scores, analyzed and visualized by admin</title>
		<link>http://www.neilkodner.com/2010/08/twifficiency-scores-analyzed-and-visualized/comment-page-1/#comment-247</link>
		<dc:creator>admin</dc:creator>
		<pubDate>Wed, 18 Aug 2010 12:02:37 +0000</pubDate>
		<guid isPermaLink="false">http://www.neilkodner.com/?p=311#comment-247</guid>
		<description>Thanks for the comment.  As soon as I hit post, I knew you would chime in!  I appreciate it.
The ks.test() returns


	One-sample Kolmogorov-Smirnov test

data:  scores$score 
D = 0.0616, p-value &lt; 2.2e-16
alternative hypothesis: two-sided</description>
		<content:encoded><![CDATA[<p>Thanks for the comment.  As soon as I hit post, I knew you would chime in!  I appreciate it.<br />
The ks.test() returns</p>
<p>	One-sample Kolmogorov-Smirnov test</p>
<p>data:  scores$score<br />
D = 0.0616, p-value < 2.2e-16<br />
alternative hypothesis: two-sided</p>
]]></content:encoded>
	</item>
	<item>
		<title>Comment on Twifficiency scores, analyzed and visualized by John Myles White</title>
		<link>http://www.neilkodner.com/2010/08/twifficiency-scores-analyzed-and-visualized/comment-page-1/#comment-246</link>
		<dc:creator>John Myles White</dc:creator>
		<pubDate>Wed, 18 Aug 2010 11:58:09 +0000</pubDate>
		<guid isPermaLink="false">http://www.neilkodner.com/?p=311#comment-246</guid>
		<description>This is very cool. I think your distribution looks asymmetric enough that it&#039;s not normal. I&#039;d try a K-S test as follows in R:

scores &lt;- load.data()

m &lt;- mean(scores)
s &lt;- sd(scores)

ks.test(scores, &#039;pnorm&#039;, m, s)

See this page for more info: http://sekhon.berkeley.edu/stats/html/ks.test.html</description>
		<content:encoded><![CDATA[<p>This is very cool. I think your distribution looks asymmetric enough that it&#8217;s not normal. I&#8217;d try a K-S test as follows in R:</p>
<p>scores &lt;- load.data()</p>
<p>m &lt;- mean(scores)<br />
s &lt;- sd(scores)</p>
<p>ks.test(scores, &#039;pnorm&#039;, m, s)</p>
<p>See this page for more info: <a href="http://sekhon.berkeley.edu/stats/html/ks.test.html" rel="nofollow">http://sekhon.berkeley.edu/stats/html/ks.test.html</a></p>
]]></content:encoded>
	</item>
	<item>
		<title>Comment on And you thought you were the first to use #DONTFUCKWITHJUSTINBIEBER by admin</title>
		<link>http://www.neilkodner.com/2010/08/and-you-thought-you-were-the-first-to-use-dontfuckwithjustinbieber/comment-page-1/#comment-244</link>
		<dc:creator>admin</dc:creator>
		<pubDate>Mon, 09 Aug 2010 21:24:42 +0000</pubDate>
		<guid isPermaLink="false">http://www.neilkodner.com/?p=287#comment-244</guid>
		<description>I like where you&#039;re going with your comments.  Lets get the easy stuff out of the way-I intended to convert everything to lowercase but was having some trouble getting it to run in pig and simply ran out of time.  After I came back to the program, I realized my error.  Long story short, I&#039;m converting them all to lowercase and am now re-running.

Huge +1 for GFW, I had no idea.

As far as frequency goes, you have to keep in mind that I&#039;m using a gardenhose connection that was seriously throttled(as was everyone&#039;s).  Around the 15th of July, my tweets/day dropped from as many as 8-9 million/day to as low as 2.5 million/day.  Oh, such first-world problems!

This is a &lt;a href=&quot;http://groups.google.com/group/twitter-development-talk/browse_thread/thread/047365fe3cfa8a02&quot; rel=&quot;nofollow&quot;&gt;known and documented&lt;/a&gt; issue with Twitter&#039;s gardenhose feed; I have since applied for limited firehose access, my application is still pending.

I&#039;m basically at the whim of twitter&#039;s streaming api rate limits as to how many tweets I can download/day.

As far as storing them, I parse the raw twitter JSON using my handy-dandy &lt;a href=&quot;http://github.com/neilkod/tweetParser&quot; rel=&quot;nofollow&quot;&gt;twitter parser&lt;/a&gt;.  It&#039;s lightweight and built for speed.  It only extracts a few fields out of the entire tweet (id, date, screenname, and tweet), and writes to STDOUT.    I&#039;ve run a few hundred million tweets through this parser and it has not failed me once.  I then group the tweets by day and save them to HDFS.  I thought about storing them on AWS but I&#039;m crippled by lousy DSL upload speeds.  The days when I was grabbing 8-9 million tweets/day produced files that are 1.1gb AFTER parsing.

Ideally, as my cluster, as well as my experience, matures, I will store the tweets in a database, most likely MongoDB.  That, in and of itself, is a new project; I&#039;m still trying to work on my pig(and hadoop)-fu in the meantime!

Thanks for the comments, they are great ones.  Watch for more data-experiments to come.</description>
		<content:encoded><![CDATA[<p>I like where you&#8217;re going with your comments.  Lets get the easy stuff out of the way-I intended to convert everything to lowercase but was having some trouble getting it to run in pig and simply ran out of time.  After I came back to the program, I realized my error.  Long story short, I&#8217;m converting them all to lowercase and am now re-running.</p>
<p>Huge +1 for GFW, I had no idea.</p>
<p>As far as frequency goes, you have to keep in mind that I&#8217;m using a gardenhose connection that was seriously throttled(as was everyone&#8217;s).  Around the 15th of July, my tweets/day dropped from as many as 8-9 million/day to as low as 2.5 million/day.  Oh, such first-world problems!</p>
<p>This is a <a href="http://groups.google.com/group/twitter-development-talk/browse_thread/thread/047365fe3cfa8a02" rel="nofollow">known and documented</a> issue with Twitter&#8217;s gardenhose feed; I have since applied for limited firehose access, my application is still pending.</p>
<p>I&#8217;m basically at the whim of twitter&#8217;s streaming api rate limits as to how many tweets I can download/day.</p>
<p>As far as storing them, I parse the raw twitter JSON using my handy-dandy <a href="http://github.com/neilkod/tweetParser" rel="nofollow">twitter parser</a>.  It&#8217;s lightweight and built for speed.  It only extracts a few fields out of the entire tweet (id, date, screenname, and tweet), and writes to STDOUT.    I&#8217;ve run a few hundred million tweets through this parser and it has not failed me once.  I then group the tweets by day and save them to HDFS.  I thought about storing them on AWS but I&#8217;m crippled by lousy DSL upload speeds.  The days when I was grabbing 8-9 million tweets/day produced files that are 1.1gb AFTER parsing.</p>
<p>Ideally, as my cluster, as well as my experience, matures, I will store the tweets in a database, most likely MongoDB.  That, in and of itself, is a new project; I&#8217;m still trying to work on my pig(and hadoop)-fu in the meantime!</p>
<p>Thanks for the comments, they are great ones.  Watch for more data-experiments to come.</p>
]]></content:encoded>
	</item>
	<item>
		<title>Comment on And you thought you were the first to use #DONTFUCKWITHJUSTINBIEBER by Tim</title>
		<link>http://www.neilkodner.com/2010/08/and-you-thought-you-were-the-first-to-use-dontfuckwithjustinbieber/comment-page-1/#comment-243</link>
		<dc:creator>Tim</dc:creator>
		<pubDate>Mon, 09 Aug 2010 21:07:24 +0000</pubDate>
		<guid isPermaLink="false">http://www.neilkodner.com/?p=287#comment-243</guid>
		<description>After looking at the dataset, 10406 occurrences of #fuck in 300 million tweets, seems extremely low. It would mean one tweet containing #fuck every 8 minutes ( 10406 / 30 * 2 * 24 * 60)

Am I missing something?</description>
		<content:encoded><![CDATA[<p>After looking at the dataset, 10406 occurrences of #fuck in 300 million tweets, seems extremely low. It would mean one tweet containing #fuck every 8 minutes ( 10406 / 30 * 2 * 24 * 60)</p>
<p>Am I missing something?</p>
]]></content:encoded>
	</item>
	<item>
		<title>Comment on And you thought you were the first to use #DONTFUCKWITHJUSTINBIEBER by Tim</title>
		<link>http://www.neilkodner.com/2010/08/and-you-thought-you-were-the-first-to-use-dontfuckwithjustinbieber/comment-page-1/#comment-242</link>
		<dc:creator>Tim</dc:creator>
		<pubDate>Mon, 09 Aug 2010 20:23:42 +0000</pubDate>
		<guid isPermaLink="false">http://www.neilkodner.com/?p=287#comment-242</guid>
		<description>Oh and I forgot, gfw stands for Great FireWall (of China) ;)</description>
		<content:encoded><![CDATA[<p>Oh and I forgot, gfw stands for Great FireWall (of China) <img src='http://www.neilkodner.com/wp-includes/images/smilies/icon_wink.gif' alt=';)' class='wp-smiley' /> </p>
]]></content:encoded>
	</item>
	<item>
		<title>Comment on And you thought you were the first to use #DONTFUCKWITHJUSTINBIEBER by Tim</title>
		<link>http://www.neilkodner.com/2010/08/and-you-thought-you-were-the-first-to-use-dontfuckwithjustinbieber/comment-page-1/#comment-241</link>
		<dc:creator>Tim</dc:creator>
		<pubDate>Mon, 09 Aug 2010 20:13:45 +0000</pubDate>
		<guid isPermaLink="false">http://www.neilkodner.com/?p=287#comment-241</guid>
		<description>300 million tweets, talk about a dataset!
I&#039;ve taken your output and converted/combined results to lowercase: http://gist.github.com/516017

May I ask which db stack are you using for storing that many tweets?</description>
		<content:encoded><![CDATA[<p>300 million tweets, talk about a dataset!<br />
I&#8217;ve taken your output and converted/combined results to lowercase: <a href="http://gist.github.com/516017" rel="nofollow">http://gist.github.com/516017</a></p>
<p>May I ask which db stack are you using for storing that many tweets?</p>
]]></content:encoded>
	</item>
	<item>
		<title>Comment on Meta:  Please excuse the ads! by Josue Rodriguez</title>
		<link>http://www.neilkodner.com/2010/07/meta-please-excuse-the-ads/comment-page-1/#comment-234</link>
		<dc:creator>Josue Rodriguez</dc:creator>
		<pubDate>Fri, 23 Jul 2010 00:45:49 +0000</pubDate>
		<guid isPermaLink="false">http://www.neilkodner.com/?p=284#comment-234</guid>
		<description>I demand my money back!!! oh wait, i mean i demand more content.</description>
		<content:encoded><![CDATA[<p>I demand my money back!!! oh wait, i mean i demand more content.</p>
]]></content:encoded>
	</item>
	<item>
		<title>Comment on A free, simple way to backup and search your tweets by admin</title>
		<link>http://www.neilkodner.com/2010/06/a-free-simple-way-to-backup-and-search-your-tweets/comment-page-1/#comment-229</link>
		<dc:creator>admin</dc:creator>
		<pubDate>Wed, 14 Jul 2010 13:29:43 +0000</pubDate>
		<guid isPermaLink="false">http://www.neilkodner.com/?p=212#comment-229</guid>
		<description>I&#039;m not sure if Yahoo has changed their verification policies to exclude email addresses with the plus sign.  You&#039;re always free to use your regular gmail address and then create a filter that checks for a Yahoo alerts email address as the sender.</description>
		<content:encoded><![CDATA[<p>I&#8217;m not sure if Yahoo has changed their verification policies to exclude email addresses with the plus sign.  You&#8217;re always free to use your regular gmail address and then create a filter that checks for a Yahoo alerts email address as the sender.</p>
]]></content:encoded>
	</item>
	<item>
		<title>Comment on A free, simple way to backup and search your tweets by Matt Gavenda</title>
		<link>http://www.neilkodner.com/2010/06/a-free-simple-way-to-backup-and-search-your-tweets/comment-page-1/#comment-228</link>
		<dc:creator>Matt Gavenda</dc:creator>
		<pubDate>Wed, 14 Jul 2010 13:21:35 +0000</pubDate>
		<guid isPermaLink="false">http://www.neilkodner.com/?p=212#comment-228</guid>
		<description>I&#039;m did all you said but I can&#039;t seem to get yahoo to accept my email mattgavenda+twitterbackup@gmail.com address b/c it says that I need to Verify it. It won&#039;t verify b/c it&#039;s not a real email address. Any thoughts? 

Thanks for the tutorial! 
Matt</description>
		<content:encoded><![CDATA[<p>I&#8217;m did all you said but I can&#8217;t seem to get yahoo to accept my email <a href="mailto:mattgavenda+twitterbackup@gmail.com">mattgavenda+twitterbackup@gmail.com</a> address b/c it says that I need to Verify it. It won&#8217;t verify b/c it&#8217;s not a real email address. Any thoughts? </p>
<p>Thanks for the tutorial!<br />
Matt</p>
]]></content:encoded>
	</item>
	<item>
		<title>Comment on World Cup Country hashtag mentions through 190gb of tweets by mat kelcey</title>
		<link>http://www.neilkodner.com/2010/06/world-cup-country-hashtag-mentions-through-190gb-of-tweets/comment-page-1/#comment-224</link>
		<dc:creator>mat kelcey</dc:creator>
		<pubDate>Mon, 28 Jun 2010 20:35:07 +0000</pubDate>
		<guid isPermaLink="false">http://www.neilkodner.com/?p=266#comment-224</guid>
		<description>brazil dominates! in terms of sending to ec2 i&#039;ve always preprocessed and sent only the fields i&#039;ve wanted in a tab seperated file. sending a file with just the tweet, if that&#039;s all you&#039;re interested in, after some hard core compression, might be feasible?</description>
		<content:encoded><![CDATA[<p>brazil dominates! in terms of sending to ec2 i&#8217;ve always preprocessed and sent only the fields i&#8217;ve wanted in a tab seperated file. sending a file with just the tweet, if that&#8217;s all you&#8217;re interested in, after some hard core compression, might be feasible?</p>
]]></content:encoded>
	</item>
</channel>
</rss>
