Showing posts with label statistics. Show all posts
Showing posts with label statistics. Show all posts

Friday, August 28, 2009

HN reader survey results

UPDATE: the charts below are smaller than I'd like, so I've posted full-sized versions in a Picasa Web Album if you'd prefer to view something larger.

I've been a fan of the Hacker News aggregation web site ever since I discovered it, and I was intrigued by the quick survey that Dave Lyon posted to HN in order to gather data for a class in machine learning algorithms. In a little more than a day, Dave collected more than 2000 responses, and posted a page pointing to the data collected. Jon von Gillem noticed that the standard charts generated by the Google Spreadsheets survey were fairly simplistic, and crunched the data to squeeze out some histogram and scatter plot goodness.

I've started learning more about R recently, and since I learn best by doing, I decided to take a crack at analyzing the data using R. I quickly abandoned the standard plotting package in favour of the excellent ggplot2 package by Hadley Wickham, which made even the somewhat complex colour scatter plots below easy to generate. Before crunching the data, I removed some of the more "suspect" submissions, and in the end decided to remove submissions with reported income > $200k to better highlight the majority of submissions in the scatter plots below.