Skip to content

Category Archives: Uncategorized

performing a grouped top-n query in pig

23-Aug-10

Over the course of generating a large item-item similarity matrix, I need to reduce the amount of data I’m returning to the calling program.  In short, i’m computing the similarity between over 20,000 different ‘items’ and that results in a gigantic dataset, to the tune of about 3-4 million elements.  I now need to reduce [...]

Twifficiency scores, analyzed and visualized

18-Aug-10

While I’ve had some success with getting a few celebrities to respond or show off @TheBotLebowski to others(fred durst, taleb kweli), Yesterday, Twifficiency one-upped me and took twitter and then the national media by storm.  Fortunately for you, @jamescun, Not too many people I know read your little Time Magazine. (I really hope you’re old [...]

And you thought you were the first to use #DONTFUCKWITHJUSTINBIEBER

09-Aug-10

Through the magic of hadoop, pig, over 300 million(and counting) tweets, and the never-ending creativity of my fellow twitter users, I thought I’d take a look at all of the hashtags containing the beloved f-word. Lets get the technical details out of the way.  Since the middle of June, I’ve been saving as many tweets [...]

Meta: Please excuse the ads!

22-Jul-10

As part of testing for another project, I’m experimenting with Google Ads on the site. They shouldn’t be around for more than a day or two.

RT @MyloGang: IT WAS JUST A FUCKIN EARTHQUAKE WTF

16-Jul-10

I woke up to @wahalulu asking me to check my tweets for mentions of the DC Earthquake and am happy to oblige:  I’ll re-run my numbers throughout the day, but here are the mentions since about 7:30 this morning EST.    Raw data here.

World Cup Country hashtag mentions through 190gb of tweets

28-Jun-10

Since the beginning of the 2010 World Cup, I’ve been saving tweets from the twitter gardenhose and trying to find interesting things in the data.  Here is a histogram showing the count of mentions for each country’s hashtag.  With apologies for my lack of effort in ggplot2: Even though my raw data was sorted by [...]

Words mentioned in 23-Jun-2010 Canadian Earthquake tweets

24-Jun-10

Using twitter gardenhose access, remove stopwords and punctuation sprinkle in a little bit of mapping, some reducing, and voila! The most frequently-occurring words in tweets that mentioned earthquake from June 23, 2010. I left earthquake out of the image itself because being that it was in every tweet, it overwhelmed the rest of the words. [...]

A free, simple way to backup and search your tweets

16-Jun-10

Edit: I’m not sure why wordpress decided to resize some of my pictures, please let me know how to avoid this. It’s not like Twitter is going to go all Ma.gnolia on us and lose all of our data but here’s a way to back up and search through your prior tweets. I’ve been using [...]

Hacking Seinfeld Tweets with Apache Pig – A work in progress

23-Apr-10

As some of you know, my twitter bot @hellooooonewman responds to every tweet containing the word/hashtag ‘Seinfeld’.  Using Python and the Twitter Search REST API, it looks for mentions and then replies to the original author with a random Seinfeld quote.  People seem to get a kick out of it, judging by its 2,000 followers [...]

Tiger Woods Apology word cloud and word-frequencies

19-Feb-10

See that tiny little ‘s-o-r-r-y’? Right under the giant ‘p-e-o-p-l-e’. He’s spoke more about others’ reactions to his infidelities than he did taking responsibility for his own actions. But what did we expect? Word-frequencies: 45 – 1 Accenture – 1 Achievements – 1 And – 1 Buddhist – 1 California – 1 Center – 1 [...]