While I’ve had some success with getting a few celebrities to respond or show off @TheBotLebowski to others(fred durst, taleb kweli), Yesterday, Twifficiency one-upped me and took twitter and then the national media by storm. Fortunately for you, @jamescun, Not too many people I know read your little Time Magazine. (I really hope you’re old enough to get that!)
As some of you may or may not know, I aggregate Twitter data and then use tools such as python, hadoop, pig, and R to play with the results. Today’s task was easy – Look through yesterday’s tweets, grab the Twifficiency auto-tweets(eeew), extract the scores, and then see if there are any interesting results.
After all of yesterday’s tweets ran through my parser, I then filtered the input data to tweets that looked like they were Twifficiency scores(where 20100817.txt is the file of yesterday’s parsed tweets that gets loaded into HDFS)
grep "My Twifficiency score is [0-9]*%. Whats yours? http://twifficiency.com/$" 20100817.txt >twif.out
I loaded twif.out through some horrible-before-the-coffee-is-even-made python code to produce a few summary statistics and then a file containing just the raw twifficiency scores:
#!/usr/bin/python
import sys,re,numpy
scores={}
scorelist=[]
# populate a dict from values 0-100 to match the existing twifficiency scores
for i in range(100):
scores[i]=0
# load an extracted sample of my twitter data
#open a file
f = open('scores.txt','w')
for line in file('twif.out'):
(id,ts,user,tweet)=line.strip().split('\t')
#extract the numeric score from the tweet
thescore=re.search('[0-9][0-9]?',tweet)
scoreval=int(thescore.group(0))
scorelist.append(scoreval)
# eventually do something with the timestamp, etc.
# print "%s\t%s\t%s" % (user,ts,scoreval)
scores[scoreval] += 1
# write the score to the raw data file
s = "%s\n" % scoreval
f.write(s)
a=numpy.array(scorelist)
print numpy.size(a)
print numpy.std(a)
print numpy.average(a)
f.close()
Now, the good stuff. Out of 7,089 twifficiency tweets from yesterday(the gardenhose has been severly throttled lately), the scores range from 0-99. I think I remember @jamescun mentioning the max score is %100, I haven’t seen it yet. The mean Twifficiency score is 38.5285 and the standard deviation is 11.1036. The score at the 25th percentile is 32, the median score is 39, and the score at the 75th percentile is 47.
After loading into R and plotting a histogram, the scores seem to follow a pretty normal distribution(update: it’s not-Check out @johnmyleswhite‘s comment below):
Although i consider myself a prominent auto-tweeter(@TheBotLebowski @HelloooooNewman @ACenterForAnts), I’m not crazy about the idea of having an app send a tweet without permission. Lets hope @jamescun fixes this or at least clarifies it a little better. Having said that, I love seeing this type of story and with James the best even if he is emo(Just kidding James!!).
edit: the raw twifficiency score data may be found here.


2 Comments
This is very cool. I think your distribution looks asymmetric enough that it’s not normal. I’d try a K-S test as follows in R:
scores <- load.data()
m <- mean(scores)
s <- sd(scores)
ks.test(scores, 'pnorm', m, s)
See this page for more info: http://sekhon.berkeley.edu/stats/html/ks.test.html
Thanks for the comment. As soon as I hit post, I knew you would chime in! I appreciate it.
The ks.test() returns
One-sample Kolmogorov-Smirnov test
data: scores$score
D = 0.0616, p-value < 2.2e-16
alternative hypothesis: two-sided
Post a Comment