Skip to content

performing a grouped top-n query in pig

23-Aug-10

Over the course of generating a large item-item similarity matrix, I need to reduce the amount of data I’m returning to the calling program.  In short, i’m computing the similarity between over 20,000 different ‘items’ and that results in a gigantic dataset, to the tune of about 3-4 million elements.  I now need to reduce my dataset down to the nearest neighbors for each item and prune irrelevant data.

The real problem is:

Given a list of 20,000 items, each item has a corresponding ‘other’ item and the Jaccard/Tanimoto similarity between the two items, show me the k-closest items for each item in my list.


1   3   .00321
1   4   .00256
1   5   .01019
1   6   .00732
2   1   .02136
....

I thought of doing this in pig but wasn’t really sure how to limit and sort grouped data.  I submitted a question to pig-user and the members were helpful.  Since I learned a new trick, I thought I’d document it here in case anyone else is looking to do the same.

Rather than bore everyone with item ids and similarity scores(and violating an NDA and losing lots of friends), I’ll use example data from one of Oracle’s demo tables, emp:


SQL> select empno,ename,job,sal,deptno from emp order by deptno,sal desc;

EMPNO ENAME      JOB              SAL     DEPTNO
---------- ---------- --------- ---------- ----------
7839 KING       PRESIDENT       5000         10
7782 CLARK      MANAGER         2450         10
7934 MILLER     CLERK           1300         10
7788 SCOTT      ANALYST         3000         20
7902 FORD       ANALYST         3000         20
7566 JONES      MANAGER         2975         20
7876 ADAMS      CLERK           1100         20
7369 SMITH      CLERK            800         20
7698 BLAKE      MANAGER         2850         30
7499 ALLEN      SALESMAN        1600         30
7844 TURNER     SALESMAN        1500         30
7654 MARTIN     SALESMAN        1250         30
7521 WARD       SALESMAN        1250         30
7900 JAMES      CLERK            950         30

From here, I want to limit my data set to the employees with the top 3-highest salaries for each department.  The part that was foreign to me was running multiple statements on my data through each iteration of a FOREACH command.  Through my brief career using hadoop and pig, I’ve never paid much attention to grouping commands together.  After reading pig-user and also looking at some of @TheDataChef‘s examples in sounder, I now recognize the value in doing so.

The code to generate the top-n (in this case 3) top salaries:

The output produced :

(5000,KING,7839,10)
(2450,CLARK,7782,10)
(1300,MILLER,7934,10)
(3000,SCOTT,7788,20)
(3000,FORD,7902,20)
(2975,JONES,7566,20)
(2850,BLAKE,7698,30)
(1600,ALLEN,7499,30)
(1500,TURNER,7844,30)

This example is easily extended to solve my original problem-reducing the number of similar items found for each item. All I’d need to do is group by the first itemid, sort the items in descending order by their Jaccard/Tanimoto score, and then limit to the top-n similarly-scored items for each original itemid.

Now that my data will be sufficiently limited after all of the similarity scores have been calculated, I can let this process run, generate lots of scores, and not worry about polluting my database with extraneous data.  Now time to focus on generating item-item similarity without generating so much useless data in the first place!  Would love to hear your suggestions.

Twifficiency scores, analyzed and visualized

18-Aug-10

While I’ve had some success with getting a few celebrities to respond or show off @TheBotLebowski to others(fred durst, taleb kweli), Yesterday, Twifficiency one-upped me and took twitter and then the national media by storm.  Fortunately for you, @jamescun, Not too many people I know read your little Time Magazine. (I really hope you’re old enough to get that!)

As some of you may or may not know, I aggregate Twitter data and then use tools such as python, hadoop, pig, and R to play with the results.  Today’s task was easy – Look through yesterday’s tweets, grab the Twifficiency auto-tweets(eeew), extract the scores, and then see if there are any interesting results.

After all of yesterday’s tweets ran through my parser, I then filtered the input data to tweets that looked like they were Twifficiency scores(where 20100817.txt is the file of yesterday’s parsed tweets that gets loaded into HDFS)

grep "My Twifficiency score is [0-9]*%. Whats yours? http://twifficiency.com/$" 20100817.txt >twif.out

I loaded twif.out through some horrible-before-the-coffee-is-even-made python code to produce a few summary statistics and then a file containing just the raw twifficiency scores:

More…

And you thought you were the first to use #DONTFUCKWITHJUSTINBIEBER

09-Aug-10

Through the magic of hadoop, pig, over 300 million(and counting) tweets, and the never-ending creativity of my fellow twitter users, I thought I’d take a look at all of the hashtags containing the beloved f-word.

Lets get the technical details out of the way.  Since the middle of June, I’ve been saving as many tweets as I can to local storage, using Twitter’s streaming API and my gardenhose access.  Sorry, Cloudera guys, I’m not yet using flume, but it’s high on the to-do list.  Using a 3-node cluster, I’m able to search through these tweets and extract valuable(?) data in a matter of minutes.

The pig script(Sorry, looks like gist.github.com doesn’t auto-format pig):

And now the fun stuff. I found over 31,000 different hashtags containing the f-word.  Bonus to the first person who can tell me what GFW is.
The top-ten results and the frequency of their mentions are:

#fuck	10406
#fuckouttahere	3172
#fuckinfollow	3062
#fuckit	2970
#fuckyou	2303
#fuckgfw	1573
#fuckyeah	1551
#fuckery	1436
#fucking	1273
#fuckoff	988

Lets move on to what’s really important-celebrities, sports figures, and other important American topics:

Lady Gaga

#dontfuckwiththegaga	11
#fuckthegagahaters	4
#fuckgaga	3
#fuckladygaga	3
#dontfuckwithgaga	2
#fuckmegaga	2
#fucktrannygaga	1

More…

Meta: Please excuse the ads!

22-Jul-10

As part of testing for another project, I’m experimenting with Google Ads on the site. They shouldn’t be around for more than a day or two.

RT @MyloGang: IT WAS JUST A FUCKIN EARTHQUAKE WTF

16-Jul-10

I woke up to @wahalulu asking me to check my tweets for mentions of the DC Earthquake and am happy to oblige:  I’ll re-run my numbers throughout the day, but here are the mentions since about 7:30 this morning EST.    Raw data here.

earthquake

World Cup Country hashtag mentions through 190gb of tweets

28-Jun-10

Since the beginning of the 2010 World Cup, I’ve been saving tweets from the twitter gardenhose and trying to find interesting things in the data.  Here is a histogram showing the count of mentions for each country’s hashtag.  With apologies for my lack of effort in ggplot2:

Even though my raw data was sorted by the counts, it appears that the default behavior of ggplot2 (or at least qplot) is an alphabetical sort.  Maybe one of you could help me wi this.  Source data is below.

Since my magnficient 2-node hadoop cluster consisting of my MacBook Pro, an old beat-up MacBook and a wireless connection isn’t quite mighty enough, I generated these numbers the old-fashioned way- through the command line.  I’m sitting on too much unprocessed data to send to Amazon S3 for EMR.  After I preprocess the tweets, the size will drastically reduce and I can then send the data to Amazon for further processing.

cat *.json|grep -iPo "#(usa|mex|hon|bra|par|chi|arg|uru|alg|civ|gha|nga|cmr|rsa|prk|jpn|kor|aus|nzl|eng|fra|esp|por|ned|den|ger|sui|ita|svk|svn|srb|gre)\b"|tr '[:upper:]' '[:lower:']|sort|uniq -c|sort -rg
 More…

Words mentioned in 23-Jun-2010 Canadian Earthquake tweets

24-Jun-10

words mentioned in earthquake tweets 23-jan-2010

Using twitter gardenhose access, remove stopwords and punctuation sprinkle in a little bit of mapping, some reducing, and voila! The most frequently-occurring words in tweets that mentioned earthquake from June 23, 2010. I left earthquake out of the image itself because being that it was in every tweet, it overwhelmed the rest of the words.  I find it amazing that the most frequently occurring ‘word’ is RT.

Also, wordle seemed to strip out numeric ‘words’ which is a shame because people tweeted the magnitude left-and-right.  See the data below for the top 100 words.

More…

A free, simple way to backup and search your tweets

16-Jun-10

Edit: I’m not sure why wordpress decided to resize some of my pictures, please let me know how to avoid this.

It’s not like Twitter is going to go all Ma.gnolia on us and lose all of our data but here’s a way to back up and search through your prior tweets. I’ve been using this September of last year and it seems to work really well.  You can even save others’ tweets, provided their timeline is public.

Yahoo Alerts, a free email and SMS notification service has a useful option send an alert whenever a RSS Feed is updated.  Not surprisingly, each of our twitter streams has its own RSS feed.  Combining the two, we can create a process that, behind-the-scenes, sends each tweet to our gmail(or wherever) accounts.

First, a new alert has to be created.  Make sure you choose Feed/Blog as the alert type:

Then, for the feed’s URL, enter

http://twitter.com/statuses/user_timeline/neilkod.rss

replacing my twitter username(neilkod) with your own. I chose to have the alert sent to my gmail account and use the +twitterbackup suffix as an identifier.  I chose to have the alerts sent as they’re created.

filter options

We’re halfway there.  Now to organize the filters.  We can create a custom gmail filter that looks for messages sent from Yahoo that use our special suffix, twitterbackup in this example.

create a gmail filter

In my case, I chose to have the label TwitterBackup applied, mark the message as read, and archive it.

gmail filter settings

So now, all of our tweets are silently loaded into our gmail account, easily retrieved if we ever need them.  This is helpful because Twitter’s search only goes back so far.

From there, searching is just a matter of looking for keywords within a certain label.  Let’s search for skate:

skate search

And here is the message detail.  You get some of the typical yahoo ‘noise’ at the end of the email, but hey, it’s free!

skate search result

And finally, back to the original tweet, with apologies to @IamJamesHall

http://twitter.com/neilkod/statuses/4835222255

Which is especially nice because twitter search has no recollection of me ever mentioning skate, especially not way back in October of 2009!(but hey, at least its fast!).

twitterskatesearch

I’d eventually like to port this to use Yahoo Pipes rather than Google Alerts to receive the data in a more concise format, but this method gets the job done.

Hacking Seinfeld Tweets with Apache Pig – A work in progress

23-Apr-10

As some of you know, my twitter bot @hellooooonewman responds to every tweet containing the word/hashtag ‘Seinfeld’.  Using Python and the Twitter Search REST API, it looks for mentions and then replies to the original author with a random Seinfeld quote.  People seem to get a kick out of it, judging by its 2,000 followers in about a month’s time.

What most don’t realize is that I’m capturing data each time my program finds a search result.  I capture:

  • Time of day the tweet was received (in California time, currently PDT)
  • Tweet ID of the tweet
  • The author of the tweet
  • The tweet itself
  • The reply sent back to the original author

In a quest to become a former Oracle DBA(not that there’s anything wrong with that!) and move into working with big data, Hadoop, etc, I’ve been spending a lot of time working with Apache Pig, the tool I have currently chosen to analyze large data sets.   After running through the tutorial programs, I thought I would try my hand at some of my own queries.  Here are a few results.

Warning: I went after some seriously low-hanging fruit for these experiments

Using data collected 21-Mar-2010 through 23-Apr-2010, I’ve captured 21,000* tweets about Seinfeld.

*Yes I know this isn’t “Big Data”, I’m just screwing around.

First thing I wanted to find out was what time of the day, specifically what hour were people tweeting about Seinfeld.    Using SQL, this would have been a no-brainer

  select  to_char(timestamp,'HH') hour
       ,  count(*)
    from  seinfeld_tweets
group by  to_char(timestamp,'HH')
order by  to_char(timestamp,'HH');

Easy enough, right? Fortunately, doing the same in Pig isn’t all that bad. The big whoa-moment (Think Keanu Reeves in The Matrix) when using Pig is realizing that the grouping and aggregation take place in different steps. Coming from a strong SQL background, that’s just weird.

Since I haven’t really mastered Pig’s date-handling, I cheated by just using SUBSTRING() to capture the hour of the tweet. A few helpful folks in the Cloudera irc channel told me that Pig’s date-handling functionality is still pretty rough, I decided to take the easy way out so that I can stick with the core learning, and not be fighting with built-in functions that don’t work entirely as they should.

The equivalent Pig code is

A = load '/Users/neil/seinfeld.log' using PigStorage('\t') as (timestamp:chararray,bot:chararray,id:chararray,author:chararray,tweet:chararray,response:chararray);
B = FOREACH A GENERATE author, org.apache.pig.piggybank.evaluation.string.SUBSTRING(timestamp,11,13) as hr;
grpd = GROUP B by hr;
cntd = FOREACH grpd GENERATE $0,COUNT(B);

Not too bad, Line 1 loads the Seinfeld log into A, Line 2 extracts the author of the tweet and the Hour. Line 3 groups by the value of hour, line 4 computes the tweets per hour. I didn’t have to sort the data because the output was already sorted in the order I wanted. This was unexpected. I dont know if I got lucky or this is desired behavior. A stark difference from Oracle where you can never really expect any order to your data, even aggregates, unless you include an ORDER BY clause.

Unscientifically, people tend to tweet more about Seinfeld as the day goes on. The times used in the raw data is the time I captured each seinfeld tweet and not really when the tweet was sent, although I estimate to never be more than ten or 15 seconds behind. I would like to modify my program to incorporate time-zones.

The second test was to see just who is tweeting about Seinfeld. Out of about 216e2 tweets (a nod to my hero @mat_kelcey) we had 15,260 distinct users mentioning Seinfeld.

The SQL code to generate this data is similar to the hour-of-the-day query above:

  select  author
       ,  count(*)
    from  seinfeld_tweets
group by  author
order by  count(*);

The Pig code used is equally similar:

usrs = GROUP A by author;
cntd = FOREACH usrs GENERATE $0 as user,COUNT(A) as cnt;
srtd= ORDER cntd BY cnt;

This time, I explicitly wanted to sort by the # of tweets for each user so I added the ORDER command.

According to R, that’s not terribly exciting data

> summary(srtd$frequency)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
  1.000   1.000   1.000   1.417   1.000  97.000

Fine, lets take a look at the top 100 Seinfeld-tweeters

> last<-tail(srtd,100)
> summary(last$frequency)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
   9.00   10.00   14.00   18.23   20.00   97.00

Still not terribly interesting, maybe to me, anyway. We know that most people have about 1 tweet, lets see who the big Seinfeld-tweeters were for the last month or so:

> tail(last)
               author frequency
15255          nuoptv        45
15256 OctavioCalegari        48
15257    AceCostaRica        51
15258       sony_prog        58
15259       ZDFneo_TV        60
15260     kaptainmyke        97

I saved you the work of looking at the profile pages, and saved myself the work of hyperlinking to each profile. Looks like @kaptainmyke is trying to sell us something, so he doesn’t count. @ZDFneo_TV and @sony_prof are both bots, so the winner of the March-April Seinfeld-tweet-off is @AceCostaRica with 51 tweets.

In order to advance my Pig skills, I will continue the experimentation throughout the next few days. One of the things I’m really liking about Pig is that since each step gets its own name, they can be reused as part of an overall larger process, I’ll try and work in some examples.

More interesting results to come.  I spent way too much time futzing around with ggplot2 trying to produce charts that didn’t make this post; I could have been doing more with Pig.  Such is the nature of tinkering.

Tiger Woods Apology word cloud and word-frequencies

19-Feb-10

See that tiny little ‘s-o-r-r-y’? Right under the giant ‘p-e-o-p-l-e’. He’s spoke more about others’ reactions to his infidelities than he did taking responsibility for his own actions. But what did we expect?

Tiger Woods Apology

Tiger Woods Apology

Word-frequencies:
45 – 1
Accenture – 1
Achievements – 1
And – 1
Buddhist – 1
California – 1
Center – 1
Character – 1
Commissioner – 1
DC – 1
December – 1
Despite – 1
Earl – 1
February – 1
Finally – 1
Finchem – 1
From – 1
Good – 1
However – 1
Instead – 1
Learning – 1
Now – 1
Obviously – 1
PGA – 1
Parents – 1
Part – 1
Please – 1
Southern – 1
Starting – 1
TOUR – 1
Thank – 1
Thanks – 1
Thanksgiving – 1
That – 1
Thats – 1
There – 1
These – 1
Thirteen – 1
Today – 1
Washington – 1
We – 1
What – 1
Whatever – 1
Woods – 1
Your – 1
above – 1
acceptable – 1
actions – 1
actively – 1
admired – 1
admit – 1
affairs – 1
again – 1
age – 1
ago – 1
alone – 1
amends – 1
angers – 1
answers – 1
any – 1
atone – 1
attacked – 1
aware – 1
because – 1
before – 1
believed – 1
bitterly – 1
board – 1
born – 1
business – 1
calls – 1
can – 1
causes – 1
centered – 1
changed – 1
chase – 1
cheated – 1
cheered – 1
childhood – 1
closest – 1
commercial – 1
completely – 1
concerned – 1
considerable – 1
continues – 1
convinced – 1
core – 1
count – 1
couple – 1
craving – 1
critical – 1
dad – 1
daughter – 1
days – 1
decency – 1
dedicate – 1
dedicated – 1
deeply – 1
deserved – 1
deserves – 1
details – 1
different – 1
direction – 1
directly – 1
directors – 1
disappointed – 1
disappointment – 1
discussing – 1
doesnt – 1
doing – 1
domestic – 1
dreams – 1
drifted – 1
drugs – 1
early – 1
education – 1
emails – 1
embarrassed – 1
encouragement – 1
end – 1
endorsements – 1
engaged – 1
enjoy – 1
enormous – 1
entire – 1
entitled – 1
envisioned – 1
episode – 1
especially – 1
example – 1
expressing – 1
fabricate – 1
facing – 1
failures – 1
faith – 1
false – 1
fame – 1
families – 1
fans – 1
fellow – 1
field – 1
first – 1
focus – 1
follow – 1
form – 1
game – 1
grace – 1
grow – 1
guidance – 1
happened – 1
heard – 1
heart – 1
helping – 1
her – 1
here – 1
hit – 1
home – 1
hope – 1
however – 1
hurting – 1
husband – 1
importance – 1
importantly – 1
impulse – 1
including – 1
inpatient – 1
integrity – 1
intend – 1
joining – 1
just – 1
keeping – 1
kept – 1
knew – 1
learn – 1
letters – 1
live – 1
lives – 1
location – 1
long – 1
looking – 1
lost – 1
maintain – 1
man – 1
married – 1
matter – 1
matters – 1
media – 1
millions – 1
mistakes – 1
model – 1
mom – 1
money – 1
morning – 1
move – 1
needs – 1
normal – 1
now – 1
once – 1
ordeal – 1
ourselves – 1
outside – 1
over – 1
overcome – 1
pain – 1
paparazzi – 1
part – 1
partners – 1
path – 1
patience – 1
peers – 1
performanceenhancing – 1
personal – 1
personally – 1
phone – 1
photographs – 1
plan – 1
play – 1
please – 1
point – 1
pointed – 1
pointless – 1
poise – 1
position – 1
practiced – 1
praise – 1
press – 1
probably – 1
proceed – 1
process – 1
professional – 1
professionally – 1
public – 1
pursued – 1
put – 1
question – 1
raised – 1
ran – 1
reach – 1
reached – 1
real – 1
realize – 1
reason – 1
receive – 1
received – 1
receiving – 1
recognize – 1
regain – 1
released – 1
relying – 1
remains – 1
remarks – 1
repeated – 1
repeating – 1
report – 1
respectful – 1
restraint – 1
role – 1
rule – 1
said – 1
sake – 1
same – 1
save – 1
scholars – 1
school – 1
schools – 1
search – 1
security – 1
seeing – 1
seek – 1
seeking – 1
separate – 1
setting – 1
shame – 1
shield – 1
should – 1
shown – 1
simply – 1
some – 1
someday – 1
somehow – 1
space – 1
special – 1
speculated – 1
spiritual – 1
spotlight – 1
staff – 1
staked – 1
start – 1
started – 1
starts – 1
steps – 1
stop – 1
stopped – 1
story – 1
straight – 1
supported – 1
sure – 1
taken – 1
temptations – 1
than – 1
thats – 1
they – 1
think – 1
thousands – 1
throughout – 1
times – 1
today – 1
together – 1
tomorrow – 1
track – 1
treatment – 1
true – 1
truly – 1
two – 1
twoandahalfyearold – 1
unchanged – 1
unhappy – 1
until – 1
us – 1
utterly – 1
values – 1
violence – 1
wants – 1
week – 1
weeks – 1
whatever – 1
when – 1
where – 1
whether – 1
which – 1
why – 1
wifes – 1
wishes – 1
words – 1
worry – 1
written – 1
wrongdoings – 1
year – 1
Buddhism – 2
But – 2
In – 2
It – 2
My – 2
People – 2
The – 2
This – 2
When – 2
achieve – 2
also – 2
always – 2
apology – 2
around – 2
away – 2
balance – 2
become – 2
better – 2
blame – 2
boundaries – 2
brought – 2
change – 2
come – 2
continue – 2
course – 2
damage – 2
didnt – 2
discuss – 2
each – 2
ever – 2
far – 2
felt – 2
following – 2
foolish – 2
forward – 2
foundation – 2
get – 2
go – 2
golf – 2
good – 2
hard – 2
hurt – 2
important – 2
involved – 2
irresponsible – 2
issue – 2
issues – 2
leave – 2
like – 2
living – 2
look – 2
making – 2
marriage – 2
means – 2
most – 2
mother – 2
other – 2
others – 2
our – 2
questions – 2
recent – 2
remain – 2
rules – 2
selfish – 2
sponsors – 2
still – 2
students – 2
support – 2
teaches – 2
through – 2
time – 2
tried – 2
understanding – 2
unfaithful – 2
used – 2
wanted – 2
way – 2
were – 2
while – 2
worked – 2
world – 2
would – 2
wrong – 2
years – 2
your – 2
youve – 2
As – 3
For – 3
Im – 3
Its – 3
Many – 3
Some – 3
To – 3
an – 3
apply – 3
as – 3
ask – 3
been – 3
between – 3
caused – 3
day – 3
down – 3
every – 3
everyone – 3
find – 3
had – 3
its – 3
learned – 3
let – 3
make – 3
need – 3
night – 3
owe – 3
person – 3
players – 3
private – 3
really – 3
return – 3
right – 3
sorry – 3
taught – 3
them – 3
thought – 3
understand – 3
up – 3
we – 3
with – 3
young – 3
They – 4
about – 4
believe – 4
but – 4
children – 4
did – 4
dont – 4
friends – 4
how – 4
lot – 4
made – 4
many – 4
more – 4
only – 4
or – 4
say – 4
thank – 4
their – 4
therapy – 4
there – 4
these – 4
those – 4
at – 5
by – 5
could – 5
done – 5
help – 5
kids – 5
out – 5
room – 5
so – 5
work – 5
all – 6
do – 6
family – 6
life – 6
myself – 6
never – 6
not – 6
one – 6
what – 6
wife – 6
Ive – 7
be – 7
has – 7
is – 7
on – 7
from – 8
know – 8
things – 8
who – 8
am – 9
are – 9
behavior – 9
it – 9
want – 9
Elin – 10
will – 10
people – 11
this – 11
was – 11
for – 19
you – 19
a – 22
in – 23
that – 24
have – 28
me – 29
of – 29
the – 38
and – 48
my – 53
to – 77
I – 105