Responsibly Handling a migration of data from PHP app to Rails App (PostgreSQL)

I have recently rebuilt an app that was being run on a MySQL powered PHP CMS. It had a lot of production data, both DB records and files & images associated with many of those records.

I have gotten a SQL dump and pulled out the requisite info into CSV files, and I have downloaded all the files & images to my local drive.

Ideally, what I would like to do is to import those records in the CSV files into my local dev PostgreSQL install and have any files/images associated with that record uploaded to my S3 bucket via Carrierwave (BTW…all of this is setup in a vanilla way, i.e. I can create a record and upload files & images to S3 just fine).

The CSV files have tens of thousands of records.

What is the best approach to import these records locally (keeping created_at and updated_at records accurate - i.e. per the CSV and not per the datetime when it was imported)?

Also, once I have implemented this locally and I have all of this “old data” imported and S3 files uploaded to my dev bucket, how do I approach this for Heroku? Right now, I am running the free package, but I can upgrade with no issues. Should I upgrade to a fullblown PostgreSQL service? Or will the normal $9/mo PGSQL upgrade work?

If you can’t tackle the entire thing, then maybe just giving me guidance on how I can break this down into more manageable steps because it feels a bit overwhelming right now.

Thanks!

I would do the data import with Postgres directly rather than trying to run the data through a Ruby script.

The PG docs on the COPY command are here:

PGAdmin also has a pretty decent import utility if you want to get some GUI sugar on top of the import process:

If your data set is really big, scripting it through Ruby may be time and/or memory constrained.

HTH.

That sounds good for my local copy. What happens with Heroku? How do I then import these stuff to Heroku?

Also, if I don’t use a Ruby script, how do I override the automatic setting of “created_at” and “updated_at” or is that an AR feature and not a PG thing?

@marcamillion, you can connect to your Heroku database the same way you would connect to a local database. The Postgres page on Heroku shows you the connection parameters you’d need to connect via a GUI client, webapp, or command line utility.

If you are going to move these records up to a Heroku instance then my recommendation to try to use PG and to avoid Ruby goes double. ActiveRecord has a lot of memory overhead for each record, and I have run into problems with import scripts for tens of thousands of records running out of memory on Heroku even with attempts to batch the imports into groups, etc.

1 Like

I’d suggest taking a look at http://pgloader.io/ I’ve used it in the past and it’s worked well.

3 Likes

@jondkinney, that tool looks awesome. I like the reject logs it stores.

Thanks for the responses guys. @jondkinney I concur with @geoffharcourt. That tool looks very interesting!

So @geoffharcourt for my Heroku DB, would you suggest I fork my live DB and then import these to that…and do it that way? What’s the best way to do this with zero-downtime? It would be awesome if I could do this on a fork of my production db, and then once I am happy with it, just point the production app to the forked DB and use that as production…but I am not sure how possible/feasible that is with Heroku.

Any thoughts?

Testing against a fork of your production DB (maybe your staging DB?) is a great way to check your work, but if you do the import against a fork and your users are updating other data during the import, swapping in your fork for the original is going to wipe out those changes.

Note that if you’re using the Dev plan at Heroku (the $9/mo plan), you can’t do forks. Forks are reserved for production-tier instances.

Yeh I am using the dev plan, and I am aware re: changes. The only writes we have in the DB are user registrations and creations of posts which moderators do. So while I am importing, I can freeze all writes with no issue.

By production-tier instances you mean I need to upgrade from the free (or 2-dynos) or do I need to get the standalone postgres product that they have? I know they have this - Fully Managed Database as a Service - PostgreSQL | Heroku and the regular tiers for their dynos.

Which did you mean?

You don’t need to have multiple dynos to do the fork, just a Standard-0 or higher DB: Heroku Postgres - Add-ons - Heroku Elements

Ok Thanks!