Skip to content

Hacking Seinfeld Tweets with Apache Pig – A work in progress

As some of you know, my twitter bot @hellooooonewman responds to every tweet containing the word/hashtag ‘Seinfeld’.  Using Python and the Twitter Search REST API, it looks for mentions and then replies to the original author with a random Seinfeld quote.  People seem to get a kick out of it, judging by its 2,000 followers in about a month’s time.

What most don’t realize is that I’m capturing data each time my program finds a search result.  I capture:

  • Time of day the tweet was received (in California time, currently PDT)
  • Tweet ID of the tweet
  • The author of the tweet
  • The tweet itself
  • The reply sent back to the original author

In a quest to become a former Oracle DBA(not that there’s anything wrong with that!) and move into working with big data, Hadoop, etc, I’ve been spending a lot of time working with Apache Pig, the tool I have currently chosen to analyze large data sets.   After running through the tutorial programs, I thought I would try my hand at some of my own queries.  Here are a few results.

Warning: I went after some seriously low-hanging fruit for these experiments

Using data collected 21-Mar-2010 through 23-Apr-2010, I’ve captured 21,000* tweets about Seinfeld.

*Yes I know this isn’t “Big Data”, I’m just screwing around.

First thing I wanted to find out was what time of the day, specifically what hour were people tweeting about Seinfeld.    Using SQL, this would have been a no-brainer

  select  to_char(timestamp,'HH') hour
       ,  count(*)
    from  seinfeld_tweets
group by  to_char(timestamp,'HH')
order by  to_char(timestamp,'HH');

Easy enough, right? Fortunately, doing the same in Pig isn’t all that bad. The big whoa-moment (Think Keanu Reeves in The Matrix) when using Pig is realizing that the grouping and aggregation take place in different steps. Coming from a strong SQL background, that’s just weird.

Since I haven’t really mastered Pig’s date-handling, I cheated by just using SUBSTRING() to capture the hour of the tweet. A few helpful folks in the Cloudera irc channel told me that Pig’s date-handling functionality is still pretty rough, I decided to take the easy way out so that I can stick with the core learning, and not be fighting with built-in functions that don’t work entirely as they should.

The equivalent Pig code is

A = load '/Users/neil/seinfeld.log' using PigStorage('\t') as (timestamp:chararray,bot:chararray,id:chararray,author:chararray,tweet:chararray,response:chararray);
B = FOREACH A GENERATE author, org.apache.pig.piggybank.evaluation.string.SUBSTRING(timestamp,11,13) as hr;
grpd = GROUP B by hr;
cntd = FOREACH grpd GENERATE $0,COUNT(B);

Not too bad, Line 1 loads the Seinfeld log into A, Line 2 extracts the author of the tweet and the Hour. Line 3 groups by the value of hour, line 4 computes the tweets per hour. I didn’t have to sort the data because the output was already sorted in the order I wanted. This was unexpected. I dont know if I got lucky or this is desired behavior. A stark difference from Oracle where you can never really expect any order to your data, even aggregates, unless you include an ORDER BY clause.

Unscientifically, people tend to tweet more about Seinfeld as the day goes on. The times used in the raw data is the time I captured each seinfeld tweet and not really when the tweet was sent, although I estimate to never be more than ten or 15 seconds behind. I would like to modify my program to incorporate time-zones.

The second test was to see just who is tweeting about Seinfeld. Out of about 216e2 tweets (a nod to my hero @mat_kelcey) we had 15,260 distinct users mentioning Seinfeld.

The SQL code to generate this data is similar to the hour-of-the-day query above:

  select  author
       ,  count(*)
    from  seinfeld_tweets
group by  author
order by  count(*);

The Pig code used is equally similar:

usrs = GROUP A by author;
cntd = FOREACH usrs GENERATE $0 as user,COUNT(A) as cnt;
srtd= ORDER cntd BY cnt;

This time, I explicitly wanted to sort by the # of tweets for each user so I added the ORDER command.

According to R, that’s not terribly exciting data

> summary(srtd$frequency)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
  1.000   1.000   1.000   1.417   1.000  97.000

Fine, lets take a look at the top 100 Seinfeld-tweeters

> last<-tail(srtd,100)
> summary(last$frequency)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
   9.00   10.00   14.00   18.23   20.00   97.00

Still not terribly interesting, maybe to me, anyway. We know that most people have about 1 tweet, lets see who the big Seinfeld-tweeters were for the last month or so:

> tail(last)
               author frequency
15255          nuoptv        45
15256 OctavioCalegari        48
15257    AceCostaRica        51
15258       sony_prog        58
15259       ZDFneo_TV        60
15260     kaptainmyke        97

I saved you the work of looking at the profile pages, and saved myself the work of hyperlinking to each profile. Looks like @kaptainmyke is trying to sell us something, so he doesn’t count. @ZDFneo_TV and @sony_prof are both bots, so the winner of the March-April Seinfeld-tweet-off is @AceCostaRica with 51 tweets.

In order to advance my Pig skills, I will continue the experimentation throughout the next few days. One of the things I’m really liking about Pig is that since each step gets its own name, they can be reused as part of an overall larger process, I’ll try and work in some examples.

More interesting results to come.  I spent way too much time futzing around with ggplot2 trying to produce charts that didn’t make this post; I could have been doing more with Pig.  Such is the nature of tinkering.

7 Comments

  1. unknown

    your automatic twitter posting is horrible and you should stop your unwanted posting. You have been turned in for spamming!

    Posted on 26-Apr-10 at 6:18 pm | Permalink
  2. nice post neil,
    i understand your ggplot2 pain; it can produce such beautiful pics, i’ve just got no idea how to do it. :)

    Posted on 29-Apr-10 at 5:50 am | Permalink
  3. admin

    I do fairly well with ggplot2, it’s just that I’d rather be spending the time working on the queries themselves, and the information to be retrieved from the queries, than worry about tweaking visualizations. I’m sure you can relate.

    Posted on 29-Apr-10 at 6:30 am | Permalink
  4. admin

    Sorry you feel so strongly against my twitter bots. 2300 followers can’t be wrong though. I’ll gladly add you to the exclude list if you’d like. Too bad that you’re hiding behind a fake email address.

    Posted on 29-Apr-10 at 6:31 am | Permalink
  5. admin

    oh by no means do I find ggplot2 painful – I just don’t use it enough to remember the syntax and procedures!

    Posted on 28-May-10 at 10:38 am | Permalink
  6. Hello Neil,

    Cool program and analysis. I just came across this in a way that you might find funny. I too have a Twitter/Seinfeld mishmash, but mine works a little differently. My program is a tool to post a quote of the day using multiple accounts as “voices”. See here: http://twitter.com/fragconteur/seinfeld-quote-of-the-day

    Today I just tweeted a program update that had the word Seinfeld, which your program picked up on it and responded brilliantly: “Hello Newman” http://twitter.com/hellooooonewman/status/17196914097

    What a perfect way for one Seinfeld quote machine to greet another.

    (By the way, in your analysis shouldn’t you look at Seinfeld-tweets-per-hour divided by total-tweets-per-hour for a better view of when people are most likely to tweet about it?)

    Posted on 27-Jun-10 at 4:20 pm | Permalink
  7. admin

    Thanks for the comment – I love it when bots collide. A funny example is when a Star Wars bot replies to my Zoolander bot whenever @ACenterForAnts says “What say we settle this on the runway, Han-Solo?”.

    As far as the analysis, yes, more is coming. I’m the type of person who loves to take the time to perform the analysis but finds the writeup part painful. I’ll crunch a few more numbers and update the post with the result!

    Posted on 28-Jun-10 at 3:44 am | Permalink

3 Trackbacks/Pingbacks

  1. [...] This post was mentioned on Twitter by neil kodner, Multimedia Mixer. Multimedia Mixer said: @neilkod good article! http://bit.ly/9kMqAz [...]

  2. uberVU - social comments on 27-Apr-10 at 9:40 am

    Social comments and analytics for this post…

    This post was mentioned on Twitter by neilkod: Hacking Seinfeld Tweets with Apache Pig. More like hacking Pig with Seinfeld Tweets. http://bit.ly/9kMqAz

  3. [...] tweets.  For examples of some of the things I’ve done with the twitter data, check out this analysis of Seinfeld Tweets or this word cloud generated from 23,000 tweets about Charlie Sheen.  Please contact me if [...]

Post a Comment

Your email is never published nor shared. Required fields are marked *
*
*