The Future of SQL Query Engine

Author
Lori Lu
Technical Evangelist, Kyligence
Jun. 12, 2022
 

Well, let’s have an OPEN and HONEST discussion about the Status Quo and the Future of SQL Query Engines for Big Data.
 

In September 2021, Matt Turck published a (long!) post, Red Hot: The 2021 Machine Learning, AI and Data (MAD) Landscape, and provided a macro view of the MAD ecosystem for 2021. When I first saw the insanely packed ecosystem map from Matt’s article, I immediately felt the PAINs and STRUGGLEs from CIOs and Sales Teams. CIOs might feel deeply frustrated when thinking through the pros and cons of each option. Sales team leaders might be cursing a lot, thinking hard about how to survive the never-ending sales cycles as more competitors enter this market. However, this is just the beginning of how innovative the big data ecosystem could become. As data warehouses and lakehouses penetrate every organization on the planet, without a doubt, this landscape will become even more bloated.

 

Red Hot: The 2021 Machine Learning, AI and Data (MAD) Landscape here
 

As we analyze and keep track of the top players on the list over the years, we find each vendor has its unique value and market fit. For example, Dremio and Databricks fully embrace the next-gen, no-copy lakehouse architecture with loads of technical innovations and breakthroughs inside their offerings; Clickhouse community is booming and ready to scale in the global market; Presto/Starburst is featured with running interactive federated queries; Apache Druid is favored for its lightning-fast analytical query performance and real-time feature; PingCAP offers a groundbreaking solution that handles mixed transactional and analytical workloads; Kyligence/Kylin modernizes and reinvents OLAP multi-dimensional cubes for cloud applications. Some niche startups, such as Firebolt, become more visible recently, claiming second query response time. Overall, all the disruptors are creating new values and breaking the status quo for the analytics community. They are working hard to move the needle for customers across industries.

 

What does this phenomenon indicate?

 

My answer sounds a bit negative — None of them is able to fulfill the diverse needs and evolving use cases across industries —from real-time analytics, OLTP, OLAP to a hybrid of a few or more.

 

What’s worse?

 

In today’s market, each vendor has developed its own unique value proposition by focusing on solving a specific challenge arising from a specific use case for a defined segment of buyers. It is almost impossible for a single company to build a perfect all-in-one product and take market share from its competitors. We have to admit the fact that:

 

There is no one-fits-all solution!

 

Thus, every company must purchase a different query engine for each specific use case. Looking into the future, this trend will continue, whether it’s for avoiding vendor lock-in or filling some gaps that mainstream cloud vendors are unwilling to do. Each company will end up adopting more than one data analytics product, and unavoidably, a new form of data silos will be created with data held in different systems.

 

Data Silos?

 

No, we definitely don’t want to bring that back as we are reinventing the wheels in many businesses for a better future.

 

There is something wrong with our market.

Why does this dilemma occur? Fundamentally, we believe there is a mindset flaw in all the players involved in the war of the modern data stack. Each player is thinking from their perspective instead of putting their customers first. “Your win is not equal to customers’ success.”

Customers don’t want data silos created by each query engine they buy. But in reality, customers also understand one size does NOT fit all!

So, how to get out of this dilemma?

 

Here is our proposal:

 

All vendors need to collaboratively re-imagine and re-engineer the Next Generation of SQL Query Engine for the benefit of all parties:

 

A Unified Query Entry Point

on Top of Decentralized Query Engines/Data Sources

 

For end data consumers, this middle layer creates a single entry point for them to access data silos transparently;

For tech vendors, they can play to their best strength and focus on solving their well-defined problems;

For buyers/companies, they can get the best out of all vendors without worrying about the integration work.

On top of that, we need to add more values to this layer for our customers: This middle layer should be super performant, scalable, and LOW-COST.

This is our belief regarding what the future should look like:

The future SQL query engine should provide a unified query entry point on top of decentralized data sources and support high-concurrency, low-latency, real-time data access at LOW COST.

 

Here is our try:

 

Let me walk you through the underlying logic of how Kyligence designs its query engine to match future needs.

 

The next generation of SQL query engines should be well re-engineered to be super performant and scalable on top of decentralized data sources.
 

Performance and Cost

 

The Exponential Growth of Data is independent of Cost & Query Performance
 

First of all, we firmly believe, that performance and cost are the two primary factors customers care most about. Therefore, we took the multi-dimensional database concept and built a modernized, distributed multi-dimensional database that can fit into any shape of a data lake. The major benefits provided by dimensional cubes are: First, performance gains and high concurrency — query results are preprocessed beforehand (in other words, heavy computation completes offline) and ready to serve downstream data consumers. So, at query run time, compute power is mainly used for retrieving query results and sending them back to consumers. This is the secret of why the Kyligence engine can handle a large volume of concurrent queries without sacrificing performance. Second, cost cuts — precomputed query results, aka indexes, will be resued as much as possible and can be refreshed by segments and partitions. This will help customers save loads of dollars for rainy days in the long run.

For more technical details of how,

please read this blog.

 

A Unified Query Entry Point on Top of Decentralized Data Sources

 
 

A modernized OLAP cube sits between data applications/consumers and decentralized data sources as a unified query entry point. It serves as a thin layer to enable users to connect to different data sources without learning how for each source.

Kyligence can query across data sources, including HDFS, Hive, RDBMS, and other cloud storage. This is not the same as the concept of federated queries.

An example to explain this feature is that some customers create a separate project for each data source in the Kyligence platform; In doing so, end-users from different business units can access data models built on top of each source directly through BI tools. Kyligence also makes it easier for the DevOps team to control data access in one place.

 
 

Furthermore, Kyligence AI-Augmented Engine can detect commonly issued queries and automate index building to boost query performance and avoid wasting compute power on processing the same queries over and over again.

 

Real-time Streaming Data

 

This feature is currently in the beta stage. With this functionality in place, users can easily combine streaming and batch data in one data model, no coding required.

 

Use Case #1 — Data Governance in Lakehouse Era

 
 

Multi-dimensional Model, a tidy box of wide tables, eliminates duplications and minimizes cost/query with smart acceleration.

Another benefit of using Kyligence's modernized OLAP cube technology is that it helps you manage, eliminate and reuse ETL pipelines. I know it is hard to make sense of it. But allow me to put it in context:

First, you can think of Kyligence OLAP cube as a box of flat tables, aka indexes. Now, a simple use case will illustrate how it works:

In 2021, one of Kyligence's customers faced the huge challenge of managing flat table explosion. This issue was initially caused by the fact that their internal teams were not used to reusing flat tables created by data pipelines owned by other teams. By adopting Kyligence as a data management tool, all teams start collaborating and creating shared cubes within the Kyligence platform. Then Kyligence cubes will automatically generate “flat tables” for all teams and intelligently manage the reuse and lifecycles of “flat tables”. This is part of their solution to reduce 1000k flat tables to a reasonable number.

More on this issue, read Stop the Madness! 1000k data warehouse tables got created out of 6k source tables in 2.5 years.

 

Use Case #2 — Data Mesh in Practice

 

Domain-oriented decentralized data ownership and architecture
 

If you understand the Data Mesh concept, you might find a great match between Kyligence and the idea of “data infrastructure as a central, shared service platform” that Data Mesh calls for.

Kyligence Governed Data Marts matches

Data Domain from Data Mesh;

Kyligence Unified Query Entry Point on Top of Decentralized Data Sources matches

Decentralized Data Ownership and Architecture from Data Mesh;

… , etc.

I often work with businesses that like to organize their data into domain-oriented projects and cubes. These businesses then use Kyligence Platform as a shared data infrastructure for all business people across teams. Let’s discuss this topic in detail in upcoming blogs.

   

Intelligently Manage Your Most Valuable Data | Kyligence

Kyligence guarantees high-performance, high-concurrency data services for your data analysis and applications, while…bit.ly

 
Related Articles

BI Dashboards are Creating a Technical Debt Black HoleBlog - Kyligence – Understanding the Metrics Store