I am working on an app where a user can post a url and and I want to parse that url, the same way facebook does, where they take a screencapture of the webpage and take the title then present a thumbnail to the user.
I would like to do this from Ruby, planning to use Sidekiq for the parsing.
There’s no screenshot taking place. It’s making a request to get the HTML of the page and parsing the title from that. Then it pulls images from what it determines to be the main body of the article and allows the user to choose. You could simplify by simply picking the first image tag you see.
@javiercarballo, I had this old link lying around my bookmarks, it describes creating that sort of parsing. It uses jQuery and php but maybe it can help get you started: Parse a link like Facebook
Thank you so much for your help, this is how I ended up doing it: Using PhantomJS’s screen capture feature.
class WebScreenCapture
def initialize(url,file_name)
upload_image_to_s3(url,file_name)
end
def self.get(file_name)
s3 = AWS::S3.new
o = s3.buckets[ENV["AWS_BUCKET_WEB_CAPTURES"]].objects["#{file_name}.png"]
o.public_url
end
private
def upload_image_to_s3(url,file_name)
image = HTTParty.get("http://screenshot.etf1.fr/?url=#{url}")
s3 = AWS::S3.new
obj = s3.buckets[ENV["AWS_BUCKET_WEB_CAPTURES"]].objects["#{file_name}.png"]
obj.write(image, acl: :public_read)
end
end
and then added a worker with Sidekiq for the parsing:
class ShoutParserWorker
include Sidekiq::Worker
def perform(shout_id)
shout = TextShout.find(shout_id)
if shout
check_for_urls_and_take_a_screenshot(shout)
end
tell_shout_to_rerender_in_the_playground(shout)
end
private
def check_for_urls_and_take_a_screenshot(shout)
shout.body.gsub(/^(http?:\/\/)?([\da-z\.-]+)\.([a-z\.]{2,6})([\/\w \.-]*)*\/?$/) do |match|
if WebScreenCapture.new(match, "shout_#{ shout.id }")
if shout.update_attributes(
thumbnail_image_path:"shout_#{ shout.id }",
thumbnail_image_parsed_url: match,
thumbnail_image_page_title: get_page_title_from(match)
)
end
end
end
end
def get_page_title_from(url)
r = HTTParty.get(url)
r = Nokogiri::HTML(r)
r.title
end
def tell_shout_to_rerender_in_the_playground(shout)
activity_id = Activity.where(trackable_type: "Shout", trackable_id: shout.id).
first.id
Pusher['pg_activities'].trigger('rerender_activity', {
id: "#{activity_id}"
})
end
end
@javiercarballo, Twitter and Facebook are using wither Twitter Cards or the Open Graph Protocol to make those snippets. They are both pretty easy to do (and I think sites will fall back to Open Graph if Twitter Cards aren’t present). You accomplish this by adding a few meta tags in the <head> section.
Btw, I wrote a quick gem for parsing Twitter Cards in Ruby. I haven’t done the same for Open Graph/Facebook, but it’s almost exactly the same code just with different attribute names.