Hey Kira, glad you enjoyed the video. Personally I tend to pull the full Upcase production DB regularly using Parity (I sum up this approach in the Parity section of the Heroku Weekly Iteration). If this is an option I highly recommend it to get a real picture. At a minimum, perhaps you could backup production → staging and then tinker there (after some local experimentation).
If that is not an option, you might consider building a script to generate the data. We have an example of this in the dev:prime task in Upcase, which uses FactoryGirl’s methods to aid in building structured data (with a few helper methods).
I’ve also heard of folks applying an automated anonymization script to work around compliance / privacy concerns. With this, you’d work from a copy of the production data set, but scramble any identifying data, for example replacing names with “Jane M Doe”. I don’t have any solid examples of this that I can point you to, but wanted to point it out as a third option.