Skip to content

porn, coke, hotties, and divorce: What 23,000 Charlie Sheen tweets look like.

08-Nov-10

For whatever reason, I started capturing tweets that mention Charlie Sheen beginning late on 01-Nov.  A week later, I have 23,000 tweets.  After removing stopwords and filtering out urls, the results is exactly as you’d expect!

A few other interesting(?) findings:

The top 5 hashtags used were

  • celebrity
  • an8balladay
  • gossip
  • funny
  • astrology
  • lenomono (cute – I had to look that one up)

The most often mentioned or quoted users in the Charlie Sheen tweets are

I’m purposefully omitting the top Charlie Sheen twitter-ers only because they seem to be mostly spammy accounts and from people trying to game the trending topics.

fun with nltk and Zoolander, part 1 – concordance

21-Oct-10

I combined a few of my favorite things, mainly hacking, python, natural language, and of course, Zoolander, and it makes for some fun output.

One of my favorite features of nltk is the concordance function, which basically shows the context in which a given term is used.  It is referred to early on in the nltk book and never fails to provide entertaining output(hint: see how the word kill lines up?):

A quick Google search yielded the Zoolander script here.  From there, I copied and pasted the text and saved it as zoolander.txt.

More…

performing a grouped top-n query in pig

23-Aug-10

Over the course of generating a large item-item similarity matrix, I need to reduce the amount of data I’m returning to the calling program.  In short, i’m computing the similarity between over 20,000 different ‘items’ and that results in a gigantic dataset, to the tune of about 3-4 million elements.  I now need to reduce my dataset down to the nearest neighbors for each item and prune irrelevant data.

The real problem is:

Given a list of 20,000 items, each item has a corresponding ‘other’ item and the Jaccard/Tanimoto similarity between the two items, show me the k-closest items for each item in my list.


1   3   .00321
1   4   .00256
1   5   .01019
1   6   .00732
2   1   .02136
....

I thought of doing this in pig but wasn’t really sure how to limit and sort grouped data.  I submitted a question to pig-user and the members were helpful.  Since I learned a new trick, I thought I’d document it here in case anyone else is looking to do the same.

Rather than bore everyone with item ids and similarity scores(and violating an NDA and losing lots of friends), I’ll use example data from one of Oracle’s demo tables, emp:


SQL> select empno,ename,job,sal,deptno from emp order by deptno,sal desc;

EMPNO ENAME      JOB              SAL     DEPTNO
---------- ---------- --------- ---------- ----------
7839 KING       PRESIDENT       5000         10
7782 CLARK      MANAGER         2450         10
7934 MILLER     CLERK           1300         10
7788 SCOTT      ANALYST         3000         20
7902 FORD       ANALYST         3000         20
7566 JONES      MANAGER         2975         20
7876 ADAMS      CLERK           1100         20
7369 SMITH      CLERK            800         20
7698 BLAKE      MANAGER         2850         30
7499 ALLEN      SALESMAN        1600         30
7844 TURNER     SALESMAN        1500         30
7654 MARTIN     SALESMAN        1250         30
7521 WARD       SALESMAN        1250         30
7900 JAMES      CLERK            950         30

From here, I want to limit my data set to the employees with the top 3-highest salaries for each department.  The part that was foreign to me was running multiple statements on my data through each iteration of a FOREACH command.  Through my brief career using hadoop and pig, I’ve never paid much attention to grouping commands together.  After reading pig-user and also looking at some of @TheDataChef‘s examples in sounder, I now recognize the value in doing so.

The code to generate the top-n (in this case 3) top salaries:

The output produced :

(5000,KING,7839,10)
(2450,CLARK,7782,10)
(1300,MILLER,7934,10)
(3000,SCOTT,7788,20)
(3000,FORD,7902,20)
(2975,JONES,7566,20)
(2850,BLAKE,7698,30)
(1600,ALLEN,7499,30)
(1500,TURNER,7844,30)

This example is easily extended to solve my original problem-reducing the number of similar items found for each item. All I’d need to do is group by the first itemid, sort the items in descending order by their Jaccard/Tanimoto score, and then limit to the top-n similarly-scored items for each original itemid.

Now that my data will be sufficiently limited after all of the similarity scores have been calculated, I can let this process run, generate lots of scores, and not worry about polluting my database with extraneous data.  Now time to focus on generating item-item similarity without generating so much useless data in the first place!  Would love to hear your suggestions.

Twifficiency scores, analyzed and visualized

18-Aug-10

While I’ve had some success with getting a few celebrities to respond or show off @TheBotLebowski to others(fred durst, taleb kweli), Yesterday, Twifficiency one-upped me and took twitter and then the national media by storm.  Fortunately for you, @jamescun, Not too many people I know read your little Time Magazine. (I really hope you’re old enough to get that!)

As some of you may or may not know, I aggregate Twitter data and then use tools such as python, hadoop, pig, and R to play with the results.  Today’s task was easy – Look through yesterday’s tweets, grab the Twifficiency auto-tweets(eeew), extract the scores, and then see if there are any interesting results.

After all of yesterday’s tweets ran through my parser, I then filtered the input data to tweets that looked like they were Twifficiency scores(where 20100817.txt is the file of yesterday’s parsed tweets that gets loaded into HDFS)

grep "My Twifficiency score is [0-9]*%. Whats yours? http://twifficiency.com/$" 20100817.txt >twif.out

I loaded twif.out through some horrible-before-the-coffee-is-even-made python code to produce a few summary statistics and then a file containing just the raw twifficiency scores:

More…

And you thought you were the first to use #DONTFUCKWITHJUSTINBIEBER

09-Aug-10

Through the magic of hadoop, pig, over 300 million(and counting) tweets, and the never-ending creativity of my fellow twitter users, I thought I’d take a look at all of the hashtags containing the beloved f-word.

Lets get the technical details out of the way.  Since the middle of June, I’ve been saving as many tweets as I can to local storage, using Twitter’s streaming API and my gardenhose access.  Sorry, Cloudera guys, I’m not yet using flume, but it’s high on the to-do list.  Using a 3-node cluster, I’m able to search through these tweets and extract valuable(?) data in a matter of minutes.

The pig script(Sorry, looks like gist.github.com doesn’t auto-format pig):

And now the fun stuff. I found over 31,000 different hashtags containing the f-word.  Bonus to the first person who can tell me what GFW is.
The top-ten results and the frequency of their mentions are:

#fuck	10406
#fuckouttahere	3172
#fuckinfollow	3062
#fuckit	2970
#fuckyou	2303
#fuckgfw	1573
#fuckyeah	1551
#fuckery	1436
#fucking	1273
#fuckoff	988

Lets move on to what’s really important-celebrities, sports figures, and other important American topics:

Lady Gaga

#dontfuckwiththegaga	11
#fuckthegagahaters	4
#fuckgaga	3
#fuckladygaga	3
#dontfuckwithgaga	2
#fuckmegaga	2
#fucktrannygaga	1

More…

Meta: Please excuse the ads!

22-Jul-10

As part of testing for another project, I’m experimenting with Google Ads on the site. They shouldn’t be around for more than a day or two.

RT @MyloGang: IT WAS JUST A FUCKIN EARTHQUAKE WTF

16-Jul-10

I woke up to @wahalulu asking me to check my tweets for mentions of the DC Earthquake and am happy to oblige:  I’ll re-run my numbers throughout the day, but here are the mentions since about 7:30 this morning EST.    Raw data here.

earthquake

World Cup Country hashtag mentions through 190gb of tweets

28-Jun-10

Since the beginning of the 2010 World Cup, I’ve been saving tweets from the twitter gardenhose and trying to find interesting things in the data.  Here is a histogram showing the count of mentions for each country’s hashtag.  With apologies for my lack of effort in ggplot2:

Even though my raw data was sorted by the counts, it appears that the default behavior of ggplot2 (or at least qplot) is an alphabetical sort.  Maybe one of you could help me wi this.  Source data is below.

Since my magnficient 2-node hadoop cluster consisting of my MacBook Pro, an old beat-up MacBook and a wireless connection isn’t quite mighty enough, I generated these numbers the old-fashioned way- through the command line.  I’m sitting on too much unprocessed data to send to Amazon S3 for EMR.  After I preprocess the tweets, the size will drastically reduce and I can then send the data to Amazon for further processing.

cat *.json|grep -iPo "#(usa|mex|hon|bra|par|chi|arg|uru|alg|civ|gha|nga|cmr|rsa|prk|jpn|kor|aus|nzl|eng|fra|esp|por|ned|den|ger|sui|ita|svk|svn|srb|gre)\b"|tr '[:upper:]' '[:lower:']|sort|uniq -c|sort -rg
 More…

Words mentioned in 23-Jun-2010 Canadian Earthquake tweets

24-Jun-10

words mentioned in earthquake tweets 23-jan-2010

Using twitter gardenhose access, remove stopwords and punctuation sprinkle in a little bit of mapping, some reducing, and voila! The most frequently-occurring words in tweets that mentioned earthquake from June 23, 2010. I left earthquake out of the image itself because being that it was in every tweet, it overwhelmed the rest of the words.  I find it amazing that the most frequently occurring ‘word’ is RT.

Also, wordle seemed to strip out numeric ‘words’ which is a shame because people tweeted the magnitude left-and-right.  See the data below for the top 100 words.

More…

A free, simple way to backup and search your tweets

16-Jun-10

Edit: I’m not sure why wordpress decided to resize some of my pictures, please let me know how to avoid this.

It’s not like Twitter is going to go all Ma.gnolia on us and lose all of our data but here’s a way to back up and search through your prior tweets. I’ve been using this September of last year and it seems to work really well.  You can even save others’ tweets, provided their timeline is public.

Yahoo Alerts, a free email and SMS notification service has a useful option send an alert whenever a RSS Feed is updated.  Not surprisingly, each of our twitter streams has its own RSS feed.  Combining the two, we can create a process that, behind-the-scenes, sends each tweet to our gmail(or wherever) accounts.

First, a new alert has to be created.  Make sure you choose Feed/Blog as the alert type:

Then, for the feed’s URL, enter

http://twitter.com/statuses/user_timeline/neilkod.rss

replacing my twitter username(neilkod) with your own. I chose to have the alert sent to my gmail account and use the +twitterbackup suffix as an identifier.  I chose to have the alerts sent as they’re created.

filter options

We’re halfway there.  Now to organize the filters.  We can create a custom gmail filter that looks for messages sent from Yahoo that use our special suffix, twitterbackup in this example.

create a gmail filter

In my case, I chose to have the label TwitterBackup applied, mark the message as read, and archive it.

gmail filter settings

So now, all of our tweets are silently loaded into our gmail account, easily retrieved if we ever need them.  This is helpful because Twitter’s search only goes back so far.

From there, searching is just a matter of looking for keywords within a certain label.  Let’s search for skate:

skate search

And here is the message detail.  You get some of the typical yahoo ‘noise’ at the end of the email, but hey, it’s free!

skate search result

And finally, back to the original tweet, with apologies to @IamJamesHall

http://twitter.com/neilkod/statuses/4835222255

Which is especially nice because twitter search has no recollection of me ever mentioning skate, especially not way back in October of 2009!(but hey, at least its fast!).

twitterskatesearch

I’d eventually like to port this to use Yahoo Pipes rather than Google Alerts to receive the data in a more concise format, but this method gets the job done.