The 6 Dimensions of Database Speed and Scalability

Image credit: iStockphoto/your_photo

Nearly every database and query engine on the market claims they are the fastest. If everyone's the fastest, who is to tell which database is the right match for you? While it's extremely rare for any given database to do it all, there is some truth in the claim that every analytics engine is fast and scalable in a particular way. Let’s study various dimensions of speed and scalability in analytical databases.

  1. Data Repository Scalability

Although this may seem like a basic metric, knowing how many terabytes one can load into the repository and still successfully query is important. While most analytics platforms can manage petabytes of data, a few still prescribe a hard limit on how much they can load. For example, Microsoft and Oracle post limits on hardware (where the database will no longer load data) and database size (where the database shouldn't exceed a specific size without risking performance). If you grow beyond these limits, you may have to refactor the database or discover other means to store big data.

In contrast, some query engines are used to looking at big blocks of data, and the limits on the data pool are much larger than one will ever reach. Such a data lake style of analytics leaves data where it is and, therefore, only requires analysts to load the data to conform as much to the rigors of a database. Modern databases offer the best of both worlds. They can function as a database for performance and a query engine for asking questions to large data pools without loading them into the database. This kind of flexibility is always welcome, especially if one is trying to build an architecture that factors in growth and change.

  1. CRUD Operations Speed

Some databases refer to scalability in terms of CRUD (create, read, update and delete) operations. Think about a highly dynamic data stream where the application needs INSERT, UPDATE, and DELETE queries thousands of times per second. Scalability for these databases is about building data infrastructure for real-time applications that require low latency and high throughput. When talking about scalability, this is often the metric first mentioned.

Historically, databases that handle CRUD operations were considered distinct from databases that were good with long, complex queries on big data. When designing a database, you can choose whether it is good at fetching and putting away the data with the utmost urgency or taking a couple of extra milliseconds to index the data and put it in the correct shard.

CRUD is slower when the latter is done, but complex queries can find the data more quickly. Therefore, it results in faster querying when performing analytics. Today, most databases have strategies for both high-speed CRUD and analytics.

  1. Data Loading Speed

This may not be discussed often, but data loading speed (LOAD or COPY operations) is a major factor for database performance. There are major differences among databases in this area. Some databases can use multiple nodes to simultaneously copy data, while others rely on a single and very slow loader. Data loading is a distinct metric from CRUD operations, as some data analytics engines won’t perform well for both.

Suppose your architecture is built on the cloud. In that case, data loading speeds can be important because copying, loading or moving large amounts of data in the cloud is not cheap. Therefore, there's a cost-benefit to having ephemeral workloads that spin up databases, load data, run reports and spin it all back down to save costs. This simply can't happen if your database is a lazy dog for data loading. When benchmarking during a database selection process, account for the data loading in your scalability assessment. 

  1. Single Query Speed

When it comes to performance, you will likely see a benchmarking report heavily optimized for specific conditions. Every solution will run simple queries fast. However, it's the complex queries where everything suddenly falls apart. In the real world, workloads fluctuate all the time, which is why it is advisable to test databases using complex queries involving a large dataset from the live system (preferably with caching turned off because caching can sometimes produce unrealistic results).

Some databases offer query acceleration capabilities (analyzing the data to get the fastest query output) as well as other query optimization functionalities such as predicate pushdown (pushing a query down to where the data lives) and partition pruning (improving query performance by excluding specific partitions and running the query against a smaller data set). Businesses must consider the availability of these features since they can come in handy in overcoming performance limitations when conducting their evaluations.

  1. Concurrent Queries Scalability

Speed is more than how fast something can do a single task. There’s also a concurrency (the running of multiple jobs, regardless of their order, in a given time interval) and the parallelism aspect (when multiple jobs are running at the exact same instant), which needs to be kept in mind when testing the speed of database technology.

If a database has workload management features, parallel and concurrent processing of multiple tasks at the same time can offer vast improvements in query performance. However, some databases impose a limit on simultaneous queries. This has a downstream impact on speed and performance.

Some databases offer parallelism on multi-thread processes, but when it comes to single-thread processes, they offer no ability to manage multiple queries, which can come at a significant disadvantage. When testing a new solution, give it simultaneous queries and experiment with multiple threads. 

  1. Constrained Hardware Scalability

A database creates a query plan with any given query. Such a plan describes how the database will access the data to deliver results. Sometimes the query plan is efficient and produces fast results, and sometimes, the planner doesn't quite get it right, bogging down the entire system and backing up queries behind it. You may be surprised to learn that query planning optimizations aren’t part of every modern database. Instead of focusing on nice, efficient queries, some platforms recommend adding more nodes, which translates to more cost. This won’t work if there’s no more budget to give.

A better option is to look at the time-consuming queries and develop better query plans, new ways to organize the data, improved ways to hold costly aggregates (averages and counts) and other methods for speeding up slow queries without adding more nodes. Be on the lookout for solutions that have these tuning knobs.

It's a shame that about 85% of analytics programs fail, not because of technology but through the execution of technology. Assessing the above dimensions of speed and scalability carefully will help organizations build an analytics foundation that will not only deliver high performance but also be resilient to rapid growth and sudden changes.

Steven Sarsfield, a director at Vertica, wrote this article. He also produced a popular data governance blog, articles on Medium.com and a book titled "The Data Governance Imperative."

The views and opinions expressed in this article are those of the author and do not necessarily reflect those of CDOTrends. Image credit: iStockphoto/your_photo