Using Hpricot to take steps to rid the world of abusive cut and paste
I’ve been working on a little script to pull data from the extremely cool and useful weather feeds from the NOAA. One XML entity described the general weather conditions. <weather> contains a short string like “Sunny” or “Thunderstorms”. I needed to figure out some sort of spec for this field, as in what was its maximum length and what were the possible strings. The closest thing I could find was this, which though it contains all the info I needed it was in a completely useless format. In the good old days (two months ago even). I would have taken the half an hour to copy and paste all the text into a flat file, and then use some sort of macro over and over again to format it properly. No more.
Thanks to _why’s Hpricot Ruby can do all the parsing and formatting for me – with very little code.
First, looking at the source of the document I noticed that at least the formatting was consistent. All of the possible weather fields were within <td> tags with the class “graybackgound” and were delimited by ’|’. Its this easy:
Hpricot(open("http://www.weather.gov/data/current_obs/weather.php"))
types = []
doc.search('td.graybackground').each do |el|
types.concat el.inner_html.gsub(/<p.+p>/,'').split('|')
end
types.collect! { |t| t.strip }
types.sort { |a,b| a.length <=> b.length }.each do |t|
puts t.length.to_s << " : " << t << "\n" unless t.empty?
end
doc =
to get this output:
3 : Fog 4 : Haze 4 : Fair 4 : Snow 4 : Sand 4 : Rain 4 : Hail 4 : Dust 5 : Smoke 5 : Windy 5 : Clear 7 : Drizzle 8 : Rain Fog 8 : Overcast 8 : Fog/Mist 8 : Snow Fog 9 : Rain Snow 9 : Snow Rain 10 : Sand Storm 10 : Heavy Snow . . .
No cutting and pasting receptive text from the web ever again!