

It simply didn’t make sense to linearly scale our Redshift cluster to accommodate an exponentially growing, but seldom-utilized, dataset. To add insult to injury, a majority of the event data being stored was not even being queried often. We hit an inflection point, however, where the volume of data was growing at such a rate that scaling horizontally by adding machines to our Redshift cluster was no longer technically or financially sustainable. In most cases, the solution to this problem would be trivial simply add machines to our cluster to accommodate the growing volume of data. By the start of 2017, the volume of this data already grew to over 10 billion rows. As our user base has grown, the volume of this data began growing exponentially. We store relevant event-level information such as event name, the user performing the event, the url on which the event took place, etc for just about every event that takes place in the Mode app. This type of dataset is a common culprit among quickly growing startups. The dataset in question stores all event-level data for our application.

Running Off a Horizontal CliffĪfter a brief investigation, we determined that one specific dataset was the root of our problem. Certain data sources being stored in our Redshift cluster were growing at an unsustainable rate, and we were consistently running out of storage resources. Redshift has mostly satisfied the majority of our analytical needs for the past few years, but recently, we began to notice a looming issue. In our early searches for a data warehouse, these factors made choosing Redshift a no-brainer. Redshift enables and optimizes complex analytical SQL queries, all while being linearly scalable and fully-managed within our existing AWS ecosystem.

We here at Mode Analytics have been Amazon Redshift users for about 4 years.
