Friday, August 28, 2009

HN reader survey results

UPDATE: the charts below are smaller than I'd like, so I've posted full-sized versions in a Picasa Web Album if you'd prefer to view something larger.

I've been a fan of the Hacker News aggregation web site ever since I discovered it, and I was intrigued by the quick survey that Dave Lyon posted to HN in order to gather data for a class in machine learning algorithms. In a little more than a day, Dave collected more than 2000 responses, and posted a page pointing to the data collected. Jon von Gillem noticed that the standard charts generated by the Google Spreadsheets survey were fairly simplistic, and crunched the data to squeeze out some histogram and scatter plot goodness.

I've started learning more about R recently, and since I learn best by doing, I decided to take a crack at analyzing the data using R. I quickly abandoned the standard plotting package in favour of the excellent ggplot2 package by Hadley Wickham, which made even the somewhat complex colour scatter plots below easy to generate. Before crunching the data, I removed some of the more "suspect" submissions, and in the end decided to remove submissions with reported income > $200k to better highlight the majority of submissions in the scatter plots below.

The following histograms provide a more detailed profile of HN survey participants by age, income, years in their industry, and hours worked each week. I find the age histogram particularly depressing, as I'm definitely in the long tail of the chart. I wonder why there are so few older geeks? Perhaps we disappear in some Logan's Run-esque fashion?

I was curious to see what relationship there might be between some of the variables captured by the survey, and decided to test how income and hours worked each week. The scatter plot below does suggest that those working 20 hours or less during the week earn less than those working more hours in the week, but working more than 40 hours a week doesn't appear to dramatically increase income.

I also wanted to test the relationship between age and income, but group the data by factors such as education and type of employment to see what impact such factors had. I used the spiffy capabilities of ggplot2 to quickly generate the two scatter plots below. To my (admittedly aging) eyes, no patterns immediately jump out - perhaps generating separate scatter plots by the factor elements would help highlight any patterns that may exist.

I've made available the survey results data (filtered to remove both suspect entries and entries with income > $200k) that I used in my analysis if anyone is interested in crunching the data themselves. If you do find something interesting, be sure to drop a note in a comment to this post!

1 comment: