A really quick visualization I did while researching data for another project. Census.gov has a link to the most frequently occurring first names and surnames from the 1990 census. Surely more current data must exists; I found this dataset by accident.
The original data is tab-delimited in the format:
- Name
- Frequency in percent
- Cumulative Frequency in percent
- Rank
The data was already sorted by rank so it was easy to build lists of the top 500 names in each category(male first, female first,surname):
head -500 dist.female.first | awk '{print $1":"$2}'
The data was then loaded into wordle for a quick visualization. Thumbnails are linked to full-size versions. Where I’m headed with this data is to build a corpus of first/last/surnames so that I can develop a spelling corrector, along the lines of Peter Norvig’s sublime spelling corrector. Think Google’s Did You Mean… rather than a spel checker. I’m plan on a proof-of-concept in Python, followed by an Oracle PL/SQL version. Another fun project would be to calculate the probability of a given first name + surname. I plan on spending some time searching for more current data.




One Trackback/Pingback
[...] This post was mentioned on Twitter by neil kodner, Chris McGinty. Chris McGinty said: Cool! RT @neilkod: Accidental visualization of frequently occurring first names and surnames from US Census data. http://bit.ly/6tmplp [...]
Post a Comment