Skip to content

performing a grouped top-n query in pig

Over the course of generating a large item-item similarity matrix, I need to reduce the amount of data I’m returning to the calling program.  In short, i’m computing the similarity between over 20,000 different ‘items’ and that results in a gigantic dataset, to the tune of about 3-4 million elements.  I now need to reduce my dataset down to the nearest neighbors for each item and prune irrelevant data.

The real problem is:

Given a list of 20,000 items, each item has a corresponding ‘other’ item and the Jaccard/Tanimoto similarity between the two items, show me the k-closest items for each item in my list.


1   3   .00321
1   4   .00256
1   5   .01019
1   6   .00732
2   1   .02136
....

I thought of doing this in pig but wasn’t really sure how to limit and sort grouped data.  I submitted a question to pig-user and the members were helpful.  Since I learned a new trick, I thought I’d document it here in case anyone else is looking to do the same.

Rather than bore everyone with item ids and similarity scores(and violating an NDA and losing lots of friends), I’ll use example data from one of Oracle’s demo tables, emp:


SQL> select empno,ename,job,sal,deptno from emp order by deptno,sal desc;

EMPNO ENAME      JOB              SAL     DEPTNO
---------- ---------- --------- ---------- ----------
7839 KING       PRESIDENT       5000         10
7782 CLARK      MANAGER         2450         10
7934 MILLER     CLERK           1300         10
7788 SCOTT      ANALYST         3000         20
7902 FORD       ANALYST         3000         20
7566 JONES      MANAGER         2975         20
7876 ADAMS      CLERK           1100         20
7369 SMITH      CLERK            800         20
7698 BLAKE      MANAGER         2850         30
7499 ALLEN      SALESMAN        1600         30
7844 TURNER     SALESMAN        1500         30
7654 MARTIN     SALESMAN        1250         30
7521 WARD       SALESMAN        1250         30
7900 JAMES      CLERK            950         30

From here, I want to limit my data set to the employees with the top 3-highest salaries for each department.  The part that was foreign to me was running multiple statements on my data through each iteration of a FOREACH command.  Through my brief career using hadoop and pig, I’ve never paid much attention to grouping commands together.  After reading pig-user and also looking at some of @TheDataChef‘s examples in sounder, I now recognize the value in doing so.

The code to generate the top-n (in this case 3) top salaries:

The output produced :

(5000,KING,7839,10)
(2450,CLARK,7782,10)
(1300,MILLER,7934,10)
(3000,SCOTT,7788,20)
(3000,FORD,7902,20)
(2975,JONES,7566,20)
(2850,BLAKE,7698,30)
(1600,ALLEN,7499,30)
(1500,TURNER,7844,30)

This example is easily extended to solve my original problem-reducing the number of similar items found for each item. All I’d need to do is group by the first itemid, sort the items in descending order by their Jaccard/Tanimoto score, and then limit to the top-n similarly-scored items for each original itemid.

Now that my data will be sufficiently limited after all of the similarity scores have been calculated, I can let this process run, generate lots of scores, and not worry about polluting my database with extraneous data.  Now time to focus on generating item-item similarity without generating so much useless data in the first place!  Would love to hear your suggestions.

One Comment

  1. when doing topN and N is small

    s = sort relation by $1 desc;
    h = limit s 10;

    is equivalent to

    g = group relation all;
    h = foreach g { i = TOP(10,1,relation); generate flatten(i); }

    but the second doesn’t require a sort (it’s all done in a single pass with a heap). this means no reduce step making it more composable by pig with other map tasks.

    not sure if it’s doable in a generate block though… i’ll have to try it out sometime.

    mat

    hat tip to @squarecog for teaching me this one :)

    Posted on 21-Feb-12 at 11:03 pm | Permalink

Post a Comment

Your email is never published nor shared. Required fields are marked *
*
*