Integrating Rails with the Scrapy Python Web Scraper

Doe anybody have any experience integrating a Rails app with a web scraper? It seems like the Python Scrapy project is far more advanced than anything in Ruby, and I already have some legacy code for it, so it would make sense to keep it. The Rails app runs on Heroku, and the Scrapy app can run on AWS. So here’s my questions:

  1. Does it make sense to have the Scrapy app directly modify the Heroku Rails database? I can imagine using a REST API or maybe Rabbit MQ to send the data, but what would be the gain?

  2. Assuming that the Scrapy app will directly access the Heroku Rails database, would it makes sense to let Rails handle all DB migrations, and if so, then to have the Scrapy python code in the same git repository, say under a directory called python/scrapy. It’s a tiny amount of code.

Does my proposed “architecture” make sense?

I would attempt to do all of this in Ruby if you’re going to have a Rails/Heroku/Postgres setup. Having a non ActiveRecord pice of software modifying an ActiveRecord-built database is asking for trouble.

Here’s a good blog post about building a quick scraper using Nokogiri: http://ruby.elevatedintel.com/blog/screen-scraping-with-a-saw-a-nokogiri-tutorial-with-examples/

I’ve done this a bunch with Nokogiri (and its predecessor, Hpricot), and it gets the job done well.

This project is a little rusty, but we’ve used it to scrape legacy sites when transitioning to a new CMS, making sure we’ve caught every page. It spiders the site for you, and its foundation is Nokogiri: