Scaling a Data Platform for Growth

American food delivery service DoorDash has millions of consumers and relies on ML models that work with billions of records. So how does it scale its data platform to meet growing demand and delight customers?

A blog post by Sudhir Tonse, Director of OKEngineering at DoorDash outlined some of the challenges that large and growing data-driven organizations are likely to face. The 5,000-word article is insightful for its level of detail and insights.

We summarize four key pointers below.

Collecting data for analytics

Though it is common for organizations to start off tapping their transactional databases directly for their analytical needs, this approach will not scale in the long term, cautioned Tonse.

Data should hence be extracted and loaded from the various transactional databases into a data warehouse system. The ideal data analytics stack should also be put together by examining currently deployed OLTP solutions and attributes of data used.

Prioritizing on the right things

Tonse says defining the data charter of the organization and its responsibilities is paramount to establish a clear vision. The result is the ability to prioritize requests based on shared goals – and to better support the art of saying “no”.

Without being proactive in anticipating the needs of internal customers, the data team will be perpetually scrambling and succumbing to the pressure of one-off solutions that culminate into “technical debt”.

Scaling beyond petabytes

With billions of messages processed per day, scalability is an ever-present challenge. DoorDash currently relies on Postgres, Apache Cassandra, ElasticSearch, and “a few other” data storage systems. Transactional data are transferred into the data warehouse for analytics using CDC (Change Data Capture) pipelines, though Tonse admits that it is difficult to find a scalable generalized CDC solution that meets every need – and this contributes to the overall complexity of building the pipelines.

Keeping a lid on costs

Storage, compute, and licensing costs add up and must be managed. DoorDash relies on a combination of public cloud, vendor solutions, and in-house built solutions. AWS is used for EC2 compute resources and other requirements, while multiple data vendors help to address other data needs. Tonse is not against building a solution in-house, provided it is cost-effective and can introduce efficiencies.

Finally, Tonse notes that the consistency of data and quality of output matters more than availability. This starts with the ability to detect and catch problems with data quality as early as possible and necessitates investing in monitoring tools to catch errors – backfilling large data processing pipelines can be expensive.

Image credit: iStockphoto/ArtRachen01