Just playing around with some Freebase data in preparation for a ‘who died today’ twitter bot.
Get the data and determine on which date did the most people die?
hadoop3:Downloads nkodner$ curl -O "http://download.freebase.com/datadumps/latest/browse/people/deceased_person.tsv"
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 16.3M 100 16.3M 0 0 209k 0 0:01:19 0:01:19 --:--:-- 248k
hadoop3:Downloads nkodner$ awk -F'\t' '{print $4}' deceased_person.tsv|grep "-"|sort|uniq -c|sort -n|tail -11|head
22 2008-01-03
22 2008-02-21
22 2008-05-20
23 1989-06-07
23 2009-01-13
24 2009-01-11
26 2009-04-03
27 1912-04-15
63 2001-09-11
65 1965-11-08
Surprised to see 1965-11-08 listed ahead of 2001-09-11. Why? Lets look at where people died on 1965-11-08:
hadoop3:Downloads nkodner$ grep "1965-11-08" deceased_person.tsv |awk -F'\t' '{print $5}' |sort|uniq -c|sort -n
1 Kenton County
1 Latium
1 Leicester
1 New York City
1 Toronto
3
57 American Airlines Flight 383 Crash Site
Upon further investigation, it looks as if Freebasers have set up a Victims of AA Flight 383 page, containing info on the deceased. Works for me.
How about which month/year did the most people die on?
hadoop3:Downloads nkodner$ awk -F'\t' '{print $4}' deceased_person.tsv|grep "-"|awk -F'-' '{print $2"-"$3}'|sort|uniq -c|sort -n|tail -11|head
668 02-08
668 03-06
672 01-06
673 02-11
676 01-28
677 01-10
683 01-04
692 12-31
702 01-22
752 02-02
Method of death?
hadoop3:Downloads nkodner$ awk -F'\t' '{print $3}' deceased_person.tsv|sort|uniq -c|sort -n|tail -11|head
505 Cardiovascular disease
603 Tuberculosis
742 Assassination
745 Stroke
799 Pneumonia
832 Lung cancer
913 Murder
1618 Suicide
1978 Cancer
2503 Myocardial infarction
And finally, the most common names of the deceased people listed on Freebase
hadoop3:Downloads nkodner$ awk -F '\t' '{print $1}' deceased_person.tsv |sort|uniq -c|sort -n|tail -11|head
21 William Anderson
23 John White
25 John Campbell
25 John Wilson
29 George Smith
30 John Anderson
32 William Smith
34 John Williams
35 John Taylor
36 John Smith
Nothing too deep today, maybe this data might be worth a closer look in R someday.

Post a Comment