Parallelizing 3rd Party API Calls in Long-Running Jobs

I’m working on a Rails 5 application that has long-running jobs (several hours) where I call various 3rd party APIs and store the results into a Postgres database. I’m trying to minimize the time it takes for this process to complete, and would like to parallelize it. Two questions:

  1. What are the best ways to avoid blocking I/O in a Rails application? I’ve had some success using thread pools from the concurrent-ruby gem in Rails but still encounter situations where I run out of threads. This solution is very hacky from a Rails perspective as it requires wrapping lots of logic in Rails.application.executor.wrap. I’ve also looked into EventMachine and Phusion Passenger which claim to provide “evented” server characteristics, but am not certain as to their downsides. I’ve also heard that blocking I/O is just a way of life when it comes to Ruby, and that I should consider spinning up more production instances and distribute the work between them in order to compensate for the time lost to blocking I/O.

  2. What are some canonical ways to split up long running jobs across multiple machines, and where can I learn more about dev ops in general? I’m finding that learning web app development is becoming very easy, but learning dev ops is extremely difficult. For context, I’m using Azure.

Thanks!

2 Likes

Generally, I’ve found it easy to off-load the actual “separate processes” or “threading” work to a job processing gem like:

GitHub - collectiveidea/delayed_job: Database based asynchronous priority queue system -- Extracted from Shopify or GitHub - mperham/sidekiq: Simple, efficient background processing for Ruby

Then, all you have to do is break your large long-running job into a bunch of little jobs that can run in parallel.