Reengineering Data Engineering for GenAI

Image credit: iStockphoto/kentoh

Kinetica never knew it would be in this position.

When the real-time database company started, generative AI (GenAI) was still an exclusive yet curious science project. The company instead focused on easing the workload of data engineers and helping smaller companies bridge their data engineering talent gaps.

“We really acted as the speed layer for a lakehouse or data lake environment, specializing in dealing with real-time analytics across a variety of disciplines,” Nima Negahban, chief executive officer and co-founder of Kinetica, says proudly.

Their product is a vectorized database engine that uses specialized core devices running different databases. They saw a need and went for it.

It was no mean feat; Kinetica started from scratch. "We thought about the algorithms and the data structures, how to leverage many-core devices and being able to continuously insert into an evergrowing dataset and then do complex ad hoc query," says Negahban.

Then, Open AI sprung GenAI loose into the world. Consumers and business users saw potential; data engineers groaned; Kinetica was left with a distinct advantage.

The data engineering conundrum

If you think data scientists are in demand, then you will appreciate data engineers' role more. After all, in classical analytics, a data scientist only starts work when the data engineer passes the data in the correct format.

This is a very static image of data analytics. But it also meant that many analytics projects were pre-planned.

"99% of [analytics] is pre-planned, pre-canned dashboards and reports. You write queries that leverage specific schemas. Then, you've got tons of data engineering that goes on to make sure of things like performance,” says Negahban.

As the number of data pipelines (a method in which raw data is ingested from various data sources and sent to a data lake or lakehouse) grew, so did the workload on data engineers, who numbered very few. And when analytics performance sputtered, the data scientists pointed their fingers at them for low performance. 

So, Kinetica thought out a vectorized database engine that eased the data engineering workload. For smaller data science teams at smaller enterprises, such engines helped to speed up analytics processing.

Then GenAI blew up the pressure on data engineers, making it "untenable" for them to build pipelines for a growing number of queries. That's when Kinetica saw a significant opportunity and seized it.

"It leveraged all of our different analytic disciplines on the fly. So, we think we're really well fit for the generative AI where we can do real-time vector search," says Negahban.

Generating pipelines on the fly

Kinetica did more than position its database for GenAI data engineering. It also integrated a ChatGPT-like interface to its vector database, natively handling queries in natural language and converting them to SQL on the fly.

According to Negahban, it allows analytics users to do chain-of-thought workflows with GenAI in real-time, much like today's ChatGPT users use several prompts to finetune their requests. It will enable users to break down complex queries into smaller queries in real-time.

He notes how one company used Kinetic's ChatGPT frontend to allow analytics users to rapidly do complex queries in natural language without SQL coding knowledge.

Meanwhile, the vectorized engine reduces the need to do a lot of data engineering. "You can ingest the data in its natural schema. You don't have to do roll-ups or normalization. And then leverage Kinetic as a vectorized engine to kind of do brute force computation on the fly," explains Negahban.

This simplifies data engineering "because you can continuously land the data in Kinetica and do ad hoc queries. You don't have to do so much pre-planning of data engineering pipelines where you're essentially doing some of the work of the database for the database."

Another advantage is the way Kinetic is designed to optimally use GPU clusters. "Most databases can't, but we can uniquely take advantage of it. And that kind of goes back to having a vectorized engine that you know can really harness the power of many-core devices," Negahban claims.

Kinetica does offer a CPU-only flavor. However, with GPU prices on a tear, the company’s ability to optimize how it processes the data set can help reduce AI-related processing budgets and allow more companies to start their GenAI journeys.

Kinetica follows the current trend of using tiered storage for processing. The vectorized database engine stores data in fixed-size blocks (vectors) that can be processed in parallel, using GPUs when available.

“We are not fitting everything in your data set into a RAM or VRAM of the GPU. We are using it much more as a coprocessor to accelerate the operation,” observes Negahban.

Of course, Kinetica has its challenges. As a streaming database, it is designed to receive new data as it is generated with no queuing before loading. But unlike conventional analytic databases that use batches or micro-batches, it is not fully compliant with atomicity, consistency, isolation and durability (ACID) principles.

Yet, Negahban believes companies will use Kinetica as they do not want data engineering to be a bottleneck.

Changing the DataOps focus

So, where does this leave the data engineer?

Creating data pipelines for every possibility will not be "tenable" in a GenAI world. So, Kinetica's features reduce the stress on "classical" data engineering workloads.

“You're going to have large language model (LLM) agents or user interactions that are being driven by LLM copilots that are generating queries that no one planned for. And, they're going to expect performance there. So, trying to build a pipeline for each one of these (queries) is really not a sustainable approach,” says Negahban.

Negahban still thinks we will need data engineering, but not for the tasks they do most today. One central area of focus is vector embeddings. These represent different data types in a multidimensional space, clustering similar data points together. Such embeddings allow machines to capture meanings and relationships, which are vital in GenAI models. 

Customers already use such features. Negahban notes how one customer used real-time vector search and hybrid vector search with vector embeddings generated by their models flowing into the Kinetica vectorized engine.

Negahban sees the number of use cases will only increase “exponentially” as GenAI becomes mainstream in the SME and enterprise space. And he sees Kinetica playing a central role in streamlining and simplifying data engineering for real-time analytics.

“And we’ve only just started,” he concludes.

Winston Thomas is the editor-in-chief of CDOTrends. He's a singularity believer, a blockchain enthusiast, and believes we already live in a metaverse. You can reach him at [email protected].

Image credit: iStockphoto/kentoh