Take a screenshot of a site from a URL

Hello!

I am working on an app where a user can post a url and and I want to parse that url, the same way facebook does, where they take a screencapture of the webpage and take the title then present a thumbnail to the user.

I would like to do this from Ruby, planning to use Sidekiq for the parsing.

Thanks!

And where is the question?

There’s no screenshot taking place. It’s making a request to get the HTML of the page and parsing the title from that. Then it pulls images from what it determines to be the main body of the article and allows the user to choose. You could simplify by simply picking the first image tag you see.

If you do want to take a screenshot of the page, you can do so pretty easily using PhantomJS

This is the example on how to do this from their website.

// github_screenshot.js

var page = require('webpage').create();
page.open('http://github.com/', function () {
    page.render('github.png');
    phantom.exit();
});

Then from the command line:

phantomjs github_screenshot.js

For more info, see the wiki page on screen capture and the quick start guide

@javiercarballo, I had this old link lying around my bookmarks, it describes creating that sort of parsing. It uses jQuery and php but maybe it can help get you started: Parse a link like Facebook

If you do want an actual screenshot, check out http://url2png.com/

Hey everybody!

Thank you so much for your help, this is how I ended up doing it: Using PhantomJS’s screen capture feature.

class WebScreenCapture

  def initialize(url,file_name)
    upload_image_to_s3(url,file_name)
  end

  def self.get(file_name)
    s3 = AWS::S3.new
    o = s3.buckets[ENV["AWS_BUCKET_WEB_CAPTURES"]].objects["#{file_name}.png"]
    o.public_url
  end

private

  def upload_image_to_s3(url,file_name)
    image = HTTParty.get("http://screenshot.etf1.fr/?url=#{url}")
    s3 = AWS::S3.new
    obj = s3.buckets[ENV["AWS_BUCKET_WEB_CAPTURES"]].objects["#{file_name}.png"]
    obj.write(image, acl: :public_read)
  end

end

and then added a worker with Sidekiq for the parsing:

class ShoutParserWorker
  include Sidekiq::Worker
  def perform(shout_id)
    shout = TextShout.find(shout_id)
    if shout
      check_for_urls_and_take_a_screenshot(shout)
    end
    tell_shout_to_rerender_in_the_playground(shout)
  end

private

  def check_for_urls_and_take_a_screenshot(shout)
    shout.body.gsub(/^(http?:\/\/)?([\da-z\.-]+)\.([a-z\.]{2,6})([\/\w \.-]*)*\/?$/) do |match|
      if WebScreenCapture.new(match, "shout_#{ shout.id }")
        if shout.update_attributes(
            thumbnail_image_path:"shout_#{ shout.id }",
            thumbnail_image_parsed_url: match,
            thumbnail_image_page_title: get_page_title_from(match)
            )
        end
      end
    end
  end

  def get_page_title_from(url)
    r = HTTParty.get(url)
    r = Nokogiri::HTML(r)
    r.title
  end

  def tell_shout_to_rerender_in_the_playground(shout)
    activity_id = Activity.where(trackable_type: "Shout", trackable_id: shout.id).
                    first.id

    Pusher['pg_activities'].trigger('rerender_activity', {
          id: "#{activity_id}"
        })
  end
end

@javiercarballo, Twitter and Facebook are using wither Twitter Cards or the Open Graph Protocol to make those snippets. They are both pretty easy to do (and I think sites will fall back to Open Graph if Twitter Cards aren’t present). You accomplish this by adding a few meta tags in the <head> section.

Here’s a good readup on Open Graph: http://ogp.me/

Btw, I wrote a quick gem for parsing Twitter Cards in Ruby. I haven’t done the same for Open Graph/Facebook, but it’s almost exactly the same code just with different attribute names.