← Back to Upcase

Reimagining analytical data for aggregations

(Pratik Bothra) #1

Hi All

I have seen a lot of examples here which go through advanced queries using active record, but can you share cases where you had to resort to alternatives outside of good old postgres?

I felt even the cases shared here were relatively simple -: https://thoughtbot.com/upcase/videos/advanced-querying-aggregations

For example -: We are a data aggregator and show a combination of data of an advertising system + multiple analytic platform. So our data is currently stored in multiple tables and such large joins at runtime are extremely expensive. A quick example would be - spend, clicks and impression data is stored in a separate table, and revenue data is stored based on the analytics platform.

So a typical user query is show campaigns where spend < 1000, and revenue > 1000 between 10th Feb and 20th Feb, or something like show revenue/spend > 1 and this could almost be any combination of ratio he wants. Summing across two very large tables (50 million rows +) , joining stuff is something that postgres doesn’t really shine in - even with introduction of jsonb etc in 9.5.

Currently, I’m working on alternate solution where we store data at hourly level in postgres, and we routinely write it to mongodb where we aggregate at daily level. I have seen a latency drop of 40s, to around 5s with reimagining how my data looks like.

Please note: I’m not dissing postgres. Its so easy to mess up your entire data in things like mongo, its almost unbelievable. What in fact, I’m driving at, is I haven’t seen really great support on this site for anything nosql. Expect postgres, memcache, redis, does thoughbot take on such exercises where they had to use such alternate solutions? What’s been your experience like?

I was recently talking to someone else, and he said my solution may not even last the year, and you would need to potential refactor to kafka and storm later on.

Have you had to use things like Kafka for message queues, storm etc for distributed processing? Have you had experiences with Mongo/Couchbases/Cassandra etc, and when would you recommend them?