I want to learn more about one of my favorite sites,
Woot. It was the idea of my stat friend Drew, that he might be able to predict a woot off if he had the data about woot items. Easy enough to parse a website. After looking around woot for a few minutes I found that the blog holds all the data I needed. Lucky enough the data goes back to may of 2006.
Lets start..
require "rubygems"
require "hpricot"
require "open-uri"
uri_base = "http://www.woot.com/Forums/Default.aspx?p="
Hpricot.buffer_size = 262144
Each blog page holds about 20 Woot items. It turns out that there are 58 pages of items. The Woot off days are days with more then one item.
master_list = []
(0..58).each{|page|
doc = Hpricot(open(uri_base+page.to_s))
row =(doc/'tr.itemRow')
data =[]
data = row.map{|row| ["#{(row/'div.saleMonth').text} #{(row/'div.saleDay').text} #{(row/'div.saleYear').text}","#{(row/'div.saleTitle/a').text}"] }
master_list.push(*data)
sleep(5)
}
After this I create a CSV file with data. This allows me to re-parse the data later for anything.
File.open("woot.data","w+"){|file|
file<<"Date,Item\n"
master_list.each{|woot_item|
file.puts woot_item[DATE]+","+woot_item[ITEM].gsub(",","\,")+"\n"
}
}
My code and data files are in my
SVN.