Web App plus Analytics Engine: 2 databases needed?

Situation: Web App Plus Analytics

Let’s suppose a system requires 2 database because there 2 components written by different teams:

  1. “Rails”, a web app, which displays output data from part 2.
  2. “Analytics”, a processing engine written in say Python, which has huge daily import data volumes and lots of number crunching, and will probably run on EC2 instances.

However, there are some tables/data shared between the systems. These are relatively modest in size.

What’s the best way to manage this situation of some data is shared.

My guesses are:

  1. 2 databases is definitely required, as opposed to one giant database, even though a number of tables are in common.
  2. Communication between the two systems should be via messaging, such as RabbitMQ.
  3. There be no sharing of databases across systems. See http://forum.thoughtbot.com/t/api-or-shared-database/1675
  4. Synchronization of the data of the common tables will be via messaging.

I’m just wondering if hand built messaging for synchronization of the shared data is overkill? It seems like there should be something simpler as this problem feels rather generic.

Questions:

  1. Use Heroku to manage the non-rails database?
  2. Assuming 2 separate databases
  3. Any way to manage schema migrations used by both systems?
  4. Any automated way to handle synchronization of some tables?

Why do you need two databases? Why no use something non relational that is built to handle that amount of data?

Any recommendations alternative DBs for handling large amounts of financial markets data?

Do you need strict consistency or is eventual good enough? It also depends if you’re mostly doing reads or writes.

Good general solutions are HBase, Cassandra or Riak. You have a pretty comprehensive list of solutions here.

It seems like a relatively common solution is to use synchronization or replication of a Postgres database, with the options for Postgres listed in this wiki article.

I’m leaning toward this solution:

  1. Rails has one Postgres DB, call it Rails DB.
  2. OLAP team can replicate/synchronize either certain tables or the whole Rails DB.
  3. OLAP team has own DB for whatever they need.
  4. Results of OLAP are sent back to Rails App via Rabbit MQ.
  5. Rails team only needs to communicate any changes to the tables used by the OLAP team.
  6. Rails team can publish messages for the OLAP team when certain data changes.

If the data is truly bidirectional then prepare for a world of pain. If each system needs only read-only access to data unique to the other, consider scheduled imports, API access, or database links with appropriate read only permissions (the specifics of which depend on your DB). In the case of your rails app, in can actually be configured to connect to several different databases.

I’d try very hard to avoid a situation where the data can be updated in multiple databases and needs to be synced between them. There are domains where this is necessary, but that complexity is best avoided if you can help it.

I’m going to keep the Rails DB pristine. I won’t know if the other team directly accesses the DB, synchronizes it, etc. The other team will be required to send Rails messages via RabbitMQ.

I’ll publish messages for the other team, as they require it.

There definitely are some common tables of information, and making one app is a far worst proposition, especially given the other team wants to use Python.

The Thoughtbot analytics video recommends having a separate analytics database. See: https://thoughtbot.com/upcase/videos/4-common-techniques-a-b-testing-funnels-cohort-analysis