Skip to content

Fun with awk and dead people

Just playing around with some Freebase data in preparation for a ‘who died today’ twitter bot.

Get the data and determine on which date did the most people die?


hadoop3:Downloads nkodner$ curl -O "http://download.freebase.com/datadumps/latest/browse/people/deceased_person.tsv"
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 16.3M  100 16.3M    0     0   209k      0  0:01:19  0:01:19 --:--:--  248k
hadoop3:Downloads nkodner$ awk -F'\t' '{print $4}' deceased_person.tsv|grep "-"|sort|uniq -c|sort -n|tail -11|head
  22 2008-01-03
  22 2008-02-21
  22 2008-05-20
  23 1989-06-07
  23 2009-01-13
  24 2009-01-11
  26 2009-04-03
  27 1912-04-15
  63 2001-09-11
  65 1965-11-08

Surprised to see 1965-11-08 listed ahead of 2001-09-11. Why? Lets look at where people died on 1965-11-08:

hadoop3:Downloads nkodner$ grep "1965-11-08" deceased_person.tsv |awk -F'\t' '{print $5}' |sort|uniq -c|sort -n
   1 Kenton County
   1 Latium
   1 Leicester
   1 New York City
   1 Toronto
   3
  57 American Airlines Flight 383 Crash Site

Upon further investigation, it looks as if Freebasers have set up a Victims of AA Flight 383 page, containing info on the deceased. Works for me.

How about which month/year did the most people die on?

hadoop3:Downloads nkodner$ awk -F'\t' '{print $4}' deceased_person.tsv|grep "-"|awk -F'-' '{print $2"-"$3}'|sort|uniq -c|sort -n|tail -11|head
 668 02-08
 668 03-06
 672 01-06
 673 02-11
 676 01-28
 677 01-10
 683 01-04
 692 12-31
 702 01-22
 752 02-02

Method of death?

hadoop3:Downloads nkodner$ awk -F'\t' '{print $3}' deceased_person.tsv|sort|uniq -c|sort -n|tail -11|head
 505 Cardiovascular disease
 603 Tuberculosis
 742 Assassination
 745 Stroke
 799 Pneumonia
 832 Lung cancer
 913 Murder
1618 Suicide
1978 Cancer
2503 Myocardial infarction

And finally, the most common names of the deceased people listed on Freebase

hadoop3:Downloads nkodner$ awk -F '\t' '{print $1}' deceased_person.tsv |sort|uniq -c|sort -n|tail -11|head
  21 William Anderson
  23 John White
  25 John Campbell
  25 John Wilson
  29 George Smith
  30 John Anderson
  32 William Smith
  34 John Williams
  35 John Taylor
  36 John Smith

Nothing too deep today, maybe this data might be worth a closer look in R someday.

Post a Comment

Your email is never published nor shared. Required fields are marked *
*
*