Optimizing SQL Queries in Postgres

Often when talking about optimizing data SQL in web applications we focus on the application layer concerns like n+1 queries and the like, but occasionally we'll need to dive down into the database layer itself to optimize our queries. In this Weekly Iteration thoughtbot CTO Joe Ferris takes us on a tour of how to work with Postgres to understand and optimize the structure of our database and queries to clean up the last mile of database interaction performance concerns.

This is a companion discussion topic for the original entry at https://thoughtbot.com/upcase/videos/optimizing-sql-queries-in-postgres

Thanks for the awesome video! I was just wondering what is a good/safe way to get some production data out of a production database, if we don’t want a full backup of all the data?

Hey Kira, glad you enjoyed the video. Personally I tend to pull the full Upcase production DB regularly using Parity (I sum up this approach in the Parity section of the Heroku Weekly Iteration). If this is an option I highly recommend it to get a real picture. At a minimum, perhaps you could backup production → staging and then tinker there (after some local experimentation).

If that is not an option, you might consider building a script to generate the data. We have an example of this in the dev:prime task in Upcase, which uses FactoryGirl’s methods to aid in building structured data (with a few helper methods).

I’ve also heard of folks applying an automated anonymization script to work around compliance / privacy concerns. With this, you’d work from a copy of the production data set, but scramble any identifying data, for example replacing names with “Jane M Doe”. I don’t have any solid examples of this that I can point you to, but wanted to point it out as a third option.

1 Like

@christoomey @jferris
Baller episode, thanks for putting this together.

@christoomey, @jferris - When would you want to add a NoSQL db to the mix?

I would add another database only when there’s a production issue that can’t be reasonably fixed in your primary database and that you’re sure will be fixed by introducing a new tool.

The cost of having two databases is higher than is generally appreciated:

  • Transactions are taken for granted in day-to-day development, but ACID guarantees no longer exist when you’re sharing data between two different stores.
  • Taking backups, snapshots, and local copies of data is much more difficult with multiple databases.
  • If you’re using a secondary database as a cache, now you need to solve one of computer science’s hardest problems! (Cache invalidation)
  • If you’re using a schemaless database, you now need to enforce, optimize, and document your schema by yourself, without aid from the database.

I would suggest this litmus test: if you’re not sure whether or not to introduce another database, stick with Postgres.