Often when talking about optimizing data SQL in web applications we focus on
the application layer concerns like n+1 queries and the like, but occasionally
we'll need to dive down into the database layer itself to optimize our
queries. In this Weekly Iteration thoughtbot CTO Joe Ferris takes us on a tour
of how to work with Postgres to understand and optimize the structure of our
database and queries to clean up the last mile of database interaction
performance concerns.
Thanks for the awesome video! I was just wondering what is a good/safe way to get some production data out of a production database, if we donât want a full backup of all the data?
Hey Kira, glad you enjoyed the video. Personally I tend to pull the full Upcase production DB regularly using Parity (I sum up this approach in the Parity section of the Heroku Weekly Iteration). If this is an option I highly recommend it to get a real picture. At a minimum, perhaps you could backup production â staging and then tinker there (after some local experimentation).
If that is not an option, you might consider building a script to generate the data. We have an example of this in the dev:prime task in Upcase, which uses FactoryGirlâs methods to aid in building structured data (with a few helper methods).
Iâve also heard of folks applying an automated anonymization script to work around compliance / privacy concerns. With this, youâd work from a copy of the production data set, but scramble any identifying data, for example replacing names with âJane M Doeâ. I donât have any solid examples of this that I can point you to, but wanted to point it out as a third option.
I would add another database only when thereâs a production issue that canât be reasonably fixed in your primary database and that youâre sure will be fixed by introducing a new tool.
The cost of having two databases is higher than is generally appreciated:
Transactions are taken for granted in day-to-day development, but ACID guarantees no longer exist when youâre sharing data between two different stores.
Taking backups, snapshots, and local copies of data is much more difficult with multiple databases.
If youâre using a secondary database as a cache, now you need to solve one of computer scienceâs hardest problems! (Cache invalidation)
If youâre using a schemaless database, you now need to enforce, optimize, and document your schema by yourself, without aid from the database.
I would suggest this litmus test: if youâre not sure whether or not to introduce another database, stick with Postgres.