Skip to content

Visualization of Frequently Occurring First Names and Surnames From the 1990 Census

A really quick visualization I did while researching data for another project.  Census.gov has a link to the most frequently occurring first names and surnames from the 1990 census.  Surely more current data must exists; I found this dataset by accident.

The original data is tab-delimited in the format:

  • Name
  • Frequency in percent
  • Cumulative Frequency in percent
  • Rank

The data was already sorted by rank so it was easy to build lists of the top 500 names in each category(male first, female first,surname):

head -500 dist.female.first | awk '{print $1":"$2}'

The data was then loaded into wordle for a quick visualization. Thumbnails are linked to full-size versions. Where I’m headed with this data is to build a corpus of first/last/surnames so that I can develop a spelling corrector, along the lines of Peter Norvig’s sublime spelling corrector.  Think Google’s Did You Mean… rather than a spel checker.  I’m plan on a proof-of-concept in Python, followed by an Oracle PL/SQL version.  Another fun project would be to calculate the probability of a given first name + surname.  I plan on spending some time searching for more current data.

500 most popular surnames

top 500 surnames

male first names

top male first names

female first names

top female first names

One Trackback/Pingback

  1. [...] This post was mentioned on Twitter by neil kodner, Chris McGinty. Chris McGinty said: Cool! RT @neilkod: Accidental visualization of frequently occurring first names and surnames from US Census data. http://bit.ly/6tmplp [...]

Post a Comment

Your email is never published nor shared. Required fields are marked *
*
*