Another Ruby Image Scraper
Posted by Ryan Baxter Thu, 08 Jan 2009 00:22:00 GMT
I’ve been pouring over a lot of vintage Willys pictures since starting the restoration of my 58’ CJ-5 and anyone that has worked with me knows that I tend to obsess over detail. The few quality images I’ve found has been driving me crazy and I’m amazed at how much contradicting information I’ve found about a vehicle that is only 50 years old. Given my career in technology, I’m always surprised when a Google search returns little or nothing of value.
My hard drive is steadily filling with what I have found and the old “Right-click, Save Image As…” has become tedious. Late last night I remembered a little image scraping script I wrote back in August of 2007. I’ve since cleaned it up, added a nifty progress bar, and replaced scrAPI with the Hpricot HTML parser. Neat!
I plan on doing some web crawling with it soon. Stay tuned for that. Without further ado:
# RB
require 'rubygems'
require 'fileutils'
require 'hpricot'
require 'open-uri'
require 'progressbar'
attributes = ['href', 'src']
file_extensions = ['jpg', 'jpeg', 'gif', 'png', 'tiff']
def fetch_extension(url)
return url.split('.').last
end
def fetch_file(uri)
progress_bar = nil
open(uri, :proxy => nil,
:content_length_proc => lambda { |length|
if length && 0 < length
progress_bar = ProgressBar.new(uri.to_s, length)
end
},
:progress_proc => lambda { |progress|
progress_bar.set(progress) if progress_bar
}) {|file| return file.read}
end
def save_file(file_uri)
open(file_uri.to_s.gsub!(/[\/:]/, '_'), 'wb') { |file|
file.write(fetch_file(file_uri)); puts
}
end
def scrape_urls(html, attributes)
Hpricot.buffer_size = 262144
attributes.each { |attribute|
Hpricot(html).search("[@#{attribute}]").map { |tag|
yield tag["#{attribute}"]
}
}
end
def to_absolute_uri(original_uri, url)
url = URI.parse(url.downcase)
url = original_uri + url if url.relative?
return url.normalize
end
puts 'Enter a URL:'
original_uri = URI.parse(gets.chomp!)
html = nil
begin
open(original_uri, :proxy => nil) {|source| html = source.read()}
scrape_urls(html, attributes) { |url|
if file_extensions.include?(fetch_extension(url)) then
save_file(to_absolute_uri(original_uri, url))
end
}
rescue => e
puts e
end- Posted in Code Snippets
- Meta 2 comments, permalink, rss, atom
Older posts: 1 2

