<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/css" href="/stylesheets/rss.css"?>
<rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:trackback="http://madskills.com/public/xml/rss/module/trackback/">
  <channel>
    <title>crunchlife: Another Ruby Image Scraper</title>
    <link>http://crunchlife.com/articles/2009/01/07/another-ruby-image-scraper</link>
    <language>en-us</language>
    <ttl>40</ttl>
    <description></description>
    <item>
      <title>Another Ruby Image Scraper</title>
      <description>&lt;p&gt;I&amp;#8217;ve been pouring over a lot of vintage Willys pictures since starting the restoration of &lt;a href="http://crunchlife.com/articles/2008/10/27/the-cj-5" target="_blank"&gt;my 58&amp;#8217; CJ-5&lt;/a&gt; and anyone that has worked with me knows that I tend to obsess over detail. The few quality images I&amp;#8217;ve found has been driving me crazy and I&amp;#8217;m amazed at how much contradicting information I&amp;#8217;ve found about a vehicle that is &lt;strong&gt;only&lt;/strong&gt; 50 years old. Given my career in technology, I&amp;#8217;m always surprised when a Google search returns little or nothing of value.&lt;/p&gt;

&lt;p&gt;My hard drive is steadily filling with what I have found and the old &lt;i&gt;&amp;#8220;Right-click, Save Image As&amp;#8230;&amp;#8221;&lt;/i&gt; has become tedious. Late last night I remembered a little &lt;a href="http://crunchlife.com/articles/2007/08/13/code-snippet-ruby-image-scraper" target="_blank"&gt;image scraping script&lt;/a&gt; I wrote back in August of 2007. I&amp;#8217;ve since cleaned it up, added a nifty progress bar, and replaced scrAPI with the &lt;a href="http://github.com/why/hpricot/tree/master" target="_blank"&gt;Hpricot&lt;/a&gt; HTML parser. Neat!&lt;/p&gt;

&lt;p&gt;I plan on doing some web crawling with it soon. Stay tuned for that. Without further ado:&lt;/p&gt;

&lt;div class="typocode"&gt;&lt;pre&gt;&lt;code class="typocode_ruby "&gt;&lt;span class="comment"&gt;# RB&lt;/span&gt;

&lt;span class="ident"&gt;require&lt;/span&gt; &lt;span class="punct"&gt;'&lt;/span&gt;&lt;span class="string"&gt;rubygems&lt;/span&gt;&lt;span class="punct"&gt;'&lt;/span&gt;
&lt;span class="ident"&gt;require&lt;/span&gt; &lt;span class="punct"&gt;'&lt;/span&gt;&lt;span class="string"&gt;fileutils&lt;/span&gt;&lt;span class="punct"&gt;'&lt;/span&gt;
&lt;span class="ident"&gt;require&lt;/span&gt; &lt;span class="punct"&gt;'&lt;/span&gt;&lt;span class="string"&gt;hpricot&lt;/span&gt;&lt;span class="punct"&gt;'&lt;/span&gt;
&lt;span class="ident"&gt;require&lt;/span&gt; &lt;span class="punct"&gt;'&lt;/span&gt;&lt;span class="string"&gt;open-uri&lt;/span&gt;&lt;span class="punct"&gt;'&lt;/span&gt;
&lt;span class="ident"&gt;require&lt;/span&gt; &lt;span class="punct"&gt;'&lt;/span&gt;&lt;span class="string"&gt;progressbar&lt;/span&gt;&lt;span class="punct"&gt;'&lt;/span&gt;

&lt;span class="ident"&gt;attributes&lt;/span&gt; &lt;span class="punct"&gt;=&lt;/span&gt; &lt;span class="punct"&gt;['&lt;/span&gt;&lt;span class="string"&gt;href&lt;/span&gt;&lt;span class="punct"&gt;',&lt;/span&gt; &lt;span class="punct"&gt;'&lt;/span&gt;&lt;span class="string"&gt;src&lt;/span&gt;&lt;span class="punct"&gt;']&lt;/span&gt;
&lt;span class="ident"&gt;file_extensions&lt;/span&gt; &lt;span class="punct"&gt;=&lt;/span&gt; &lt;span class="punct"&gt;['&lt;/span&gt;&lt;span class="string"&gt;jpg&lt;/span&gt;&lt;span class="punct"&gt;',&lt;/span&gt; &lt;span class="punct"&gt;'&lt;/span&gt;&lt;span class="string"&gt;jpeg&lt;/span&gt;&lt;span class="punct"&gt;',&lt;/span&gt; &lt;span class="punct"&gt;'&lt;/span&gt;&lt;span class="string"&gt;gif&lt;/span&gt;&lt;span class="punct"&gt;',&lt;/span&gt; &lt;span class="punct"&gt;'&lt;/span&gt;&lt;span class="string"&gt;png&lt;/span&gt;&lt;span class="punct"&gt;',&lt;/span&gt; &lt;span class="punct"&gt;'&lt;/span&gt;&lt;span class="string"&gt;tiff&lt;/span&gt;&lt;span class="punct"&gt;']&lt;/span&gt;

&lt;span class="keyword"&gt;def &lt;/span&gt;&lt;span class="method"&gt;fetch_extension&lt;/span&gt;&lt;span class="punct"&gt;(&lt;/span&gt;&lt;span class="ident"&gt;url&lt;/span&gt;&lt;span class="punct"&gt;)&lt;/span&gt;      
  &lt;span class="keyword"&gt;return&lt;/span&gt; &lt;span class="ident"&gt;url&lt;/span&gt;&lt;span class="punct"&gt;.&lt;/span&gt;&lt;span class="ident"&gt;split&lt;/span&gt;&lt;span class="punct"&gt;('&lt;/span&gt;&lt;span class="string"&gt;.&lt;/span&gt;&lt;span class="punct"&gt;').&lt;/span&gt;&lt;span class="ident"&gt;last&lt;/span&gt;
&lt;span class="keyword"&gt;end&lt;/span&gt;

&lt;span class="keyword"&gt;def &lt;/span&gt;&lt;span class="method"&gt;fetch_file&lt;/span&gt;&lt;span class="punct"&gt;(&lt;/span&gt;&lt;span class="ident"&gt;uri&lt;/span&gt;&lt;span class="punct"&gt;)&lt;/span&gt;
  &lt;span class="ident"&gt;progress_bar&lt;/span&gt; &lt;span class="punct"&gt;=&lt;/span&gt; &lt;span class="constant"&gt;nil&lt;/span&gt; 
  &lt;span class="ident"&gt;open&lt;/span&gt;&lt;span class="punct"&gt;(&lt;/span&gt;&lt;span class="ident"&gt;uri&lt;/span&gt;&lt;span class="punct"&gt;,&lt;/span&gt; &lt;span class="symbol"&gt;:proxy&lt;/span&gt; &lt;span class="punct"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="constant"&gt;nil&lt;/span&gt;&lt;span class="punct"&gt;,&lt;/span&gt;
    &lt;span class="symbol"&gt;:content_length_proc&lt;/span&gt; &lt;span class="punct"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="ident"&gt;lambda&lt;/span&gt; &lt;span class="punct"&gt;{&lt;/span&gt; &lt;span class="punct"&gt;|&lt;/span&gt;&lt;span class="ident"&gt;length&lt;/span&gt;&lt;span class="punct"&gt;|&lt;/span&gt;
      &lt;span class="keyword"&gt;if&lt;/span&gt; &lt;span class="ident"&gt;length&lt;/span&gt; &lt;span class="punct"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="number"&gt;0&lt;/span&gt; &lt;span class="punct"&gt;&amp;lt;&lt;/span&gt; &lt;span class="ident"&gt;length&lt;/span&gt;
        &lt;span class="ident"&gt;progress_bar&lt;/span&gt; &lt;span class="punct"&gt;=&lt;/span&gt; &lt;span class="constant"&gt;ProgressBar&lt;/span&gt;&lt;span class="punct"&gt;.&lt;/span&gt;&lt;span class="ident"&gt;new&lt;/span&gt;&lt;span class="punct"&gt;(&lt;/span&gt;&lt;span class="ident"&gt;uri&lt;/span&gt;&lt;span class="punct"&gt;.&lt;/span&gt;&lt;span class="ident"&gt;to_s&lt;/span&gt;&lt;span class="punct"&gt;,&lt;/span&gt; &lt;span class="ident"&gt;length&lt;/span&gt;&lt;span class="punct"&gt;)&lt;/span&gt;
      &lt;span class="keyword"&gt;end&lt;/span&gt; 
    &lt;span class="punct"&gt;},&lt;/span&gt;
    &lt;span class="symbol"&gt;:progress_proc&lt;/span&gt; &lt;span class="punct"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="ident"&gt;lambda&lt;/span&gt; &lt;span class="punct"&gt;{&lt;/span&gt; &lt;span class="punct"&gt;|&lt;/span&gt;&lt;span class="ident"&gt;progress&lt;/span&gt;&lt;span class="punct"&gt;|&lt;/span&gt;
      &lt;span class="ident"&gt;progress_bar&lt;/span&gt;&lt;span class="punct"&gt;.&lt;/span&gt;&lt;span class="ident"&gt;set&lt;/span&gt;&lt;span class="punct"&gt;(&lt;/span&gt;&lt;span class="ident"&gt;progress&lt;/span&gt;&lt;span class="punct"&gt;)&lt;/span&gt; &lt;span class="keyword"&gt;if&lt;/span&gt; &lt;span class="ident"&gt;progress_bar&lt;/span&gt;
    &lt;span class="punct"&gt;})&lt;/span&gt; &lt;span class="punct"&gt;{|&lt;/span&gt;&lt;span class="ident"&gt;file&lt;/span&gt;&lt;span class="punct"&gt;|&lt;/span&gt; &lt;span class="keyword"&gt;return&lt;/span&gt; &lt;span class="ident"&gt;file&lt;/span&gt;&lt;span class="punct"&gt;.&lt;/span&gt;&lt;span class="ident"&gt;read&lt;/span&gt;&lt;span class="punct"&gt;}&lt;/span&gt;        
&lt;span class="keyword"&gt;end&lt;/span&gt;

&lt;span class="keyword"&gt;def &lt;/span&gt;&lt;span class="method"&gt;save_file&lt;/span&gt;&lt;span class="punct"&gt;(&lt;/span&gt;&lt;span class="ident"&gt;file_uri&lt;/span&gt;&lt;span class="punct"&gt;)&lt;/span&gt;  
  &lt;span class="ident"&gt;open&lt;/span&gt;&lt;span class="punct"&gt;(&lt;/span&gt;&lt;span class="ident"&gt;file_uri&lt;/span&gt;&lt;span class="punct"&gt;.&lt;/span&gt;&lt;span class="ident"&gt;to_s&lt;/span&gt;&lt;span class="punct"&gt;.&lt;/span&gt;&lt;span class="ident"&gt;gsub!&lt;/span&gt;&lt;span class="punct"&gt;(/&lt;/span&gt;&lt;span class="regex"&gt;[&lt;span class="escape"&gt;\/&lt;/span&gt;:]&lt;/span&gt;&lt;span class="punct"&gt;/,&lt;/span&gt; &lt;span class="punct"&gt;'&lt;/span&gt;&lt;span class="string"&gt;_&lt;/span&gt;&lt;span class="punct"&gt;'),&lt;/span&gt; &lt;span class="punct"&gt;'&lt;/span&gt;&lt;span class="string"&gt;wb&lt;/span&gt;&lt;span class="punct"&gt;')&lt;/span&gt; &lt;span class="punct"&gt;{&lt;/span&gt; &lt;span class="punct"&gt;|&lt;/span&gt;&lt;span class="ident"&gt;file&lt;/span&gt;&lt;span class="punct"&gt;|&lt;/span&gt; 
    &lt;span class="ident"&gt;file&lt;/span&gt;&lt;span class="punct"&gt;.&lt;/span&gt;&lt;span class="ident"&gt;write&lt;/span&gt;&lt;span class="punct"&gt;(&lt;/span&gt;&lt;span class="ident"&gt;fetch_file&lt;/span&gt;&lt;span class="punct"&gt;(&lt;/span&gt;&lt;span class="ident"&gt;file_uri&lt;/span&gt;&lt;span class="punct"&gt;));&lt;/span&gt; &lt;span class="ident"&gt;puts&lt;/span&gt;
  &lt;span class="punct"&gt;}&lt;/span&gt;
&lt;span class="keyword"&gt;end&lt;/span&gt;

&lt;span class="keyword"&gt;def &lt;/span&gt;&lt;span class="method"&gt;scrape_urls&lt;/span&gt;&lt;span class="punct"&gt;(&lt;/span&gt;&lt;span class="ident"&gt;html&lt;/span&gt;&lt;span class="punct"&gt;,&lt;/span&gt; &lt;span class="ident"&gt;attributes&lt;/span&gt;&lt;span class="punct"&gt;)&lt;/span&gt;      
  &lt;span class="constant"&gt;Hpricot&lt;/span&gt;&lt;span class="punct"&gt;.&lt;/span&gt;&lt;span class="ident"&gt;buffer_size&lt;/span&gt; &lt;span class="punct"&gt;=&lt;/span&gt; &lt;span class="number"&gt;262144&lt;/span&gt;
  &lt;span class="ident"&gt;attributes&lt;/span&gt;&lt;span class="punct"&gt;.&lt;/span&gt;&lt;span class="ident"&gt;each&lt;/span&gt; &lt;span class="punct"&gt;{&lt;/span&gt; &lt;span class="punct"&gt;|&lt;/span&gt;&lt;span class="ident"&gt;attribute&lt;/span&gt;&lt;span class="punct"&gt;|&lt;/span&gt;
    &lt;span class="constant"&gt;Hpricot&lt;/span&gt;&lt;span class="punct"&gt;(&lt;/span&gt;&lt;span class="ident"&gt;html&lt;/span&gt;&lt;span class="punct"&gt;).&lt;/span&gt;&lt;span class="ident"&gt;search&lt;/span&gt;&lt;span class="punct"&gt;(&amp;quot;&lt;/span&gt;&lt;span class="string"&gt;[@&lt;span class="expr"&gt;#{attribute}&lt;/span&gt;]&lt;/span&gt;&lt;span class="punct"&gt;&amp;quot;).&lt;/span&gt;&lt;span class="ident"&gt;map&lt;/span&gt; &lt;span class="punct"&gt;{&lt;/span&gt; &lt;span class="punct"&gt;|&lt;/span&gt;&lt;span class="ident"&gt;tag&lt;/span&gt;&lt;span class="punct"&gt;|&lt;/span&gt;
      &lt;span class="keyword"&gt;yield&lt;/span&gt; &lt;span class="ident"&gt;tag&lt;/span&gt;&lt;span class="punct"&gt;[&amp;quot;&lt;/span&gt;&lt;span class="string"&gt;&lt;span class="expr"&gt;#{attribute}&lt;/span&gt;&lt;/span&gt;&lt;span class="punct"&gt;&amp;quot;]&lt;/span&gt;
    &lt;span class="punct"&gt;}&lt;/span&gt;
  &lt;span class="punct"&gt;}&lt;/span&gt;
&lt;span class="keyword"&gt;end&lt;/span&gt;

&lt;span class="keyword"&gt;def &lt;/span&gt;&lt;span class="method"&gt;to_absolute_uri&lt;/span&gt;&lt;span class="punct"&gt;(&lt;/span&gt;&lt;span class="ident"&gt;original_uri&lt;/span&gt;&lt;span class="punct"&gt;,&lt;/span&gt; &lt;span class="ident"&gt;url&lt;/span&gt;&lt;span class="punct"&gt;)&lt;/span&gt;
  &lt;span class="ident"&gt;url&lt;/span&gt; &lt;span class="punct"&gt;=&lt;/span&gt; &lt;span class="constant"&gt;URI&lt;/span&gt;&lt;span class="punct"&gt;.&lt;/span&gt;&lt;span class="ident"&gt;parse&lt;/span&gt;&lt;span class="punct"&gt;(&lt;/span&gt;&lt;span class="ident"&gt;url&lt;/span&gt;&lt;span class="punct"&gt;.&lt;/span&gt;&lt;span class="ident"&gt;downcase&lt;/span&gt;&lt;span class="punct"&gt;)&lt;/span&gt;     
  &lt;span class="ident"&gt;url&lt;/span&gt; &lt;span class="punct"&gt;=&lt;/span&gt; &lt;span class="ident"&gt;original_uri&lt;/span&gt; &lt;span class="punct"&gt;+&lt;/span&gt; &lt;span class="ident"&gt;url&lt;/span&gt; &lt;span class="keyword"&gt;if&lt;/span&gt; &lt;span class="ident"&gt;url&lt;/span&gt;&lt;span class="punct"&gt;.&lt;/span&gt;&lt;span class="ident"&gt;relative?&lt;/span&gt;  
  &lt;span class="keyword"&gt;return&lt;/span&gt; &lt;span class="ident"&gt;url&lt;/span&gt;&lt;span class="punct"&gt;.&lt;/span&gt;&lt;span class="ident"&gt;normalize&lt;/span&gt;        
&lt;span class="keyword"&gt;end&lt;/span&gt;

&lt;span class="ident"&gt;puts&lt;/span&gt; &lt;span class="punct"&gt;'&lt;/span&gt;&lt;span class="string"&gt;Enter a URL:&lt;/span&gt;&lt;span class="punct"&gt;'&lt;/span&gt;
&lt;span class="ident"&gt;original_uri&lt;/span&gt; &lt;span class="punct"&gt;=&lt;/span&gt; &lt;span class="constant"&gt;URI&lt;/span&gt;&lt;span class="punct"&gt;.&lt;/span&gt;&lt;span class="ident"&gt;parse&lt;/span&gt;&lt;span class="punct"&gt;(&lt;/span&gt;&lt;span class="ident"&gt;gets&lt;/span&gt;&lt;span class="punct"&gt;.&lt;/span&gt;&lt;span class="ident"&gt;chomp!&lt;/span&gt;&lt;span class="punct"&gt;)&lt;/span&gt;

&lt;span class="ident"&gt;html&lt;/span&gt; &lt;span class="punct"&gt;=&lt;/span&gt; &lt;span class="constant"&gt;nil&lt;/span&gt;

&lt;span class="keyword"&gt;begin&lt;/span&gt;
  &lt;span class="ident"&gt;open&lt;/span&gt;&lt;span class="punct"&gt;(&lt;/span&gt;&lt;span class="ident"&gt;original_uri&lt;/span&gt;&lt;span class="punct"&gt;,&lt;/span&gt; &lt;span class="symbol"&gt;:proxy&lt;/span&gt; &lt;span class="punct"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="constant"&gt;nil&lt;/span&gt;&lt;span class="punct"&gt;)&lt;/span&gt; &lt;span class="punct"&gt;{|&lt;/span&gt;&lt;span class="ident"&gt;source&lt;/span&gt;&lt;span class="punct"&gt;|&lt;/span&gt; &lt;span class="ident"&gt;html&lt;/span&gt; &lt;span class="punct"&gt;=&lt;/span&gt; &lt;span class="ident"&gt;source&lt;/span&gt;&lt;span class="punct"&gt;.&lt;/span&gt;&lt;span class="ident"&gt;read&lt;/span&gt;&lt;span class="punct"&gt;()}&lt;/span&gt;

  &lt;span class="ident"&gt;scrape_urls&lt;/span&gt;&lt;span class="punct"&gt;(&lt;/span&gt;&lt;span class="ident"&gt;html&lt;/span&gt;&lt;span class="punct"&gt;,&lt;/span&gt; &lt;span class="ident"&gt;attributes&lt;/span&gt;&lt;span class="punct"&gt;)&lt;/span&gt; &lt;span class="punct"&gt;{&lt;/span&gt; &lt;span class="punct"&gt;|&lt;/span&gt;&lt;span class="ident"&gt;url&lt;/span&gt;&lt;span class="punct"&gt;|&lt;/span&gt;
    &lt;span class="keyword"&gt;if&lt;/span&gt; &lt;span class="ident"&gt;file_extensions&lt;/span&gt;&lt;span class="punct"&gt;.&lt;/span&gt;&lt;span class="ident"&gt;include?&lt;/span&gt;&lt;span class="punct"&gt;(&lt;/span&gt;&lt;span class="ident"&gt;fetch_extension&lt;/span&gt;&lt;span class="punct"&gt;(&lt;/span&gt;&lt;span class="ident"&gt;url&lt;/span&gt;&lt;span class="punct"&gt;))&lt;/span&gt; &lt;span class="keyword"&gt;then&lt;/span&gt;
      &lt;span class="ident"&gt;save_file&lt;/span&gt;&lt;span class="punct"&gt;(&lt;/span&gt;&lt;span class="ident"&gt;to_absolute_uri&lt;/span&gt;&lt;span class="punct"&gt;(&lt;/span&gt;&lt;span class="ident"&gt;original_uri&lt;/span&gt;&lt;span class="punct"&gt;,&lt;/span&gt; &lt;span class="ident"&gt;url&lt;/span&gt;&lt;span class="punct"&gt;))&lt;/span&gt;
    &lt;span class="keyword"&gt;end&lt;/span&gt;
  &lt;span class="punct"&gt;}&lt;/span&gt;
&lt;span class="keyword"&gt;rescue&lt;/span&gt; &lt;span class="punct"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="ident"&gt;e&lt;/span&gt;
  &lt;span class="ident"&gt;puts&lt;/span&gt; &lt;span class="ident"&gt;e&lt;/span&gt;
&lt;span class="keyword"&gt;end&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;</description>
      <pubDate>Wed, 07 Jan 2009 16:22:00 -0800</pubDate>
      <guid isPermaLink="false">urn:uuid:32992b2b-63b9-4751-bdc2-cfc4aae59484</guid>
      <author>Ryan Baxter</author>
      <link>http://crunchlife.com/articles/2009/01/07/another-ruby-image-scraper</link>
      <category>Code Snippets</category>
      <category>CJ5</category>
      <category>Ruby</category>
    </item>
    <item>
      <title>"Another Ruby Image Scraper" by Jorge</title>
      <description>thanks !

please remove the downcase in


"def to_absolute_uri(original_uri, url)
  url = URI.parse(url.downcase) 
..."



"def to_absolute_uri(original_uri, url)
  url = URI.parse(url) 
..."


as in some sites filenames are case sensitive and it all what you will get is "page not found" error 404</description>
      <pubDate>Sat, 06 Nov 2010 11:50:23 -0700</pubDate>
      <guid isPermaLink="false">urn:uuid:a121aba7-a484-4005-90f1-3d61a420d951</guid>
      <link>http://crunchlife.com/articles/2009/01/07/another-ruby-image-scraper#comment-80942</link>
    </item>
    <item>
      <title>"Another Ruby Image Scraper" by Ryan Baxter</title>
      <description>Oops! I've had to fix a couple of bugs already. The save_file method uses the gsub! method of the String type to save the images using their URL as a file name. The colon character was causing problems as a file name in Windows. I replaced the first parameter of gsub! with a regular expression that handles both forward-slashes and colons.
&lt;br /&gt;&lt;br /&gt;
The scrape_urls method wasn't searching multiple tag attributes correctly. I fixed it by iterating an Array of attributes and searching for them one at a time.

</description>
      <pubDate>Thu, 08 Jan 2009 08:18:14 -0800</pubDate>
      <guid isPermaLink="false">urn:uuid:e66b29e7-bdf9-4bda-a0a9-faa3b9a6e381</guid>
      <link>http://crunchlife.com/articles/2009/01/07/another-ruby-image-scraper#comment-19441</link>
    </item>
  </channel>
</rss>

