Meet Your AI Copilot fot Data Learn More
Your AI Copilot for Data
Kyligence Zen Kyligence Zen
Kyligence Enterprise Kyligence Enterprise
Metrics Platform
OLAP Platform
Customers
Definitive Guide to Decision Intelligence
Recommended
Resources
Apache Kylin
About
Partners
In the Big Data field, Apache Kylin and Apache Druid (incubating) are two commonly adopted extreme OLAP Big Data engines, both of which enable fast querying on huge datasets. In the enterprises that heavily rely on big data analytics, they often run both for different use cases.
During the Apache Kylin Meetup in August 2018, the Meituan team shared their Kylin on Druid (KoD) solution. Why did they develop this hybrid system? What’s the rationale behind it? This article will answer these questions and help you to understand the differences and the pros and cons of each OLAP engine.
Apache Kylin is an open source distributed big data analytics engine. It constructs data models on top of huge datasets, builds pre-calculated OLAP cubes to support multi-dimensional analysis, and provides a SQL query interface and multi-dimensional analysis on top of Hadoop, with general ODBC, JDBC, and RESTful API interfaces. Apache Kylin’s unique pre-calculation ability enables it to handle extremely large datasets with sub-second query response times.
Graphic 1 Kylin Architecture
Druid was created in 2012. It’s an open source distributed data store. Its core design combines the concept of analytical databases, time-series databases, and search systems, and it can support data collection and analytics on fairly large datasets. Druid uses an Apache V2 license and is an Apache incubator project.
Apache Druid Architecture
From the perspective of deployment architectures, Druid’s processes mostly fall into 3 categories based on their roles.
The Historical node is in charge of loading segments (committed immutable data) and receiving queries on historical data.
Middle Manager is in charge of data ingestion and commit segments. Each task is done by a separate JVM.
Peon is in charge of completing a single task, which is managed and monitored by the Middle Manager.
Broker receives query requests, determines on which segment the data resides, and distributes sub-queries and merges query results.
Coordinator monitors Historical nodes, dispatches segments and monitor workload.
Overlord monitors Middle Manager, dispatches tasks to Middle Manager, and assists releasing of segments.
External Dependency
At the same time, Druid has 3 replaceable external dependencies.
Druid uses Deep storage to transfer data files between nodes.
Metadata Storage stores the metadata about segment positions and task output.
Druid uses Zookeeper (ZK) to ensure consistency of the cluster status.
Graphic 2 Druid Architecture
Data Source and Segment
Druid stores data in Data Source. Data Source is equivalent to Table in RDBMS. Data Source is divided into multiple Chunks based on timestamps, and data within the same time range will be organized into the same Chunk. Each Chunk consists of multiple Segments, and each Segment is a physical data file and an atomic storage unit. Due to performance consideration, size of a Segment file is recommended to be 500MB.
Graphic 3 Data Source & Segment
As Druid has features of both OLAP and Time-series database, its schema includes 3 types of columns: Timestamp, Dimension and Measures (Metrics). Timestamp column can be used for trimming Segments, and Dimension/Measures are similar to Kylin’s.
Graphic 4 Druid Schema
Druid’s Advantage
Meituan deployed into production an offline OLAP platform with Apache Kylin as its core component in 2015. Since then the platform has served almost all business lines with fast growing data volume and query executions, and the stress on the cluster has increased accordingly. Throughout the time, the tech team in Meituan keeps exploring better solutions for some of Kylin’s challenges. The major one is Apache HBase, the storage that Kylin relies on.
Kylin stores its data in HBase by converting the Dimensions and Measures into HBase Keys and Values, respectively. As HBase doesn’t support secondary index and only has one RowKey index, Kylin’s Dimension values will be combined into a fixed sequence to store as RowKey. In this way, filtering on a Dimension in the front of the sequence will perform better than those at the back (note: Kyligence Enterprise has no this limitation, see [1]). Here’s an example:
In the testing environment, there are two almost identical Cubes (Cube1 and Cube2). They both have the same data source and the same Dimensions/Measures. The only difference is the order of the Dimensions in the RowKey: Cube1 puts P_LINEORDER.LO_CUSTKEY at the first while Cube2 the last.
Graphic 5 Cube1 RowKey Sequence
Graphic 6 Cube2 RowKey Sequence
Now let’s query each Cube with the same SQL and compare the response time.
select S_SUPPKEY, C_CUSTKEY, sum(LO_EXTENDEDPRICE) as m1 from P_LINEORDER left join SUPPLIER on P_LINEORDER.LO_SUPPKEY = SUPPLIER.S_SUPPKEY left join CUSTOMER on P_LINEORDER.LO_CUSTKEY = CUSTOMER.C_CUSTKEY WHERE (LO_ORDERKEY > 1799905 and LO_ORDERKEY < 1799915) or (LO_ORDERKEY > 1999905 and LO_ORDERKEY < 1999935) GROUP BY S_SUPPKEY, C_CUSTKEY;
Below shows the time consumed and data scanned:
Graphic 7 Cube1 Query Log
Graphic 8 Cube1 Query Log
The test result shows that the same SQL query can perform 200x time differently. The primary reason for the difference is that the query has to scan a larger range of data in the HTable of Cube2.
In addition, Kylin’s multiple Dimension values are stored in a Key’s corresponding Value, so when querying on a single Dimension, unnecessary Dimensions are read as well, generating unnecessary IO.
To summarize, the limitation of HBase impacts the user’s experience, especially among business groups.
Kylin’s query performance and user experience can be greatly improved with pure columnar storage and multiple indexes on Dimensions. As analyzed above, Druid happens to meet the requirements of columnar + multi-index. So the Meituan Kylin team decided to try replacing HBase with Druid.
Why not just use Druid then? Meituan’s engineers shared their thoughts:
From the experiences of utilizing Druid and Kylin, we recognize the management and operational challenges in using Druid as the OLAP engine:
Therefore, it appears to be a promising OLAP solution to combine Druid’s excellent columnar storage with Kylin’s usability, compatibility, and completeness. Druid has columnar storage, inverted index, better filtering performance than HBase, native OLAP features, and good secondary aggregation capabilities. Meituan’s tech team decided to try replacing HBase with Druid as the storage for Kylin.
At v1.5, Apache Kylin introduced plugable architecture and de-coupled computing and storage components, which makes the replacement of HBase possible. Here is a brief introduction to the main design concept of Kylin on Druid based on Meituan engineer Kaisen Kang’s design doc. (Graphics 9 and 10 are from reference[1], and text are from reference[1] and [3])
Process for Building OLAP Cubes
Graphic 9 Process of Building Cube
Process of Querying Cube
Graphic 10 Process of Querying an OLAP Cube
Schema Mapping
In this article, we first analyzed features and pros/cons of both Kylin and Druid, and the reasons for unstable performance of HBase in Kylin in some cases. Then we searched solutions and found the feasible option of using Druid as the Kylin storage engine. At last, we illustrated the Kylin-on-Druid architecture and the processes developed by Meituan.
Stay tuned for our next article about how to use Kylin on Druid, how it performs, and how it can be improved.
[1]: Kyligence Enterprise has replaced HBase with KyStorage as the underlying storage engine; KyStorage is a patented fast columnar storage. With KyStorage, Kyligence Enterprise saves 50% of storage space compared to other SQL on Hadoop engines while increasing query speed by 15 times in average.
Discover all the facts about Apache Kylin and see how its extreme OLAP technology compares to the Augmented Analytics capabilities of Kyligence. Learn more about the augmented OLAP approaches of Apache Kylin and Kyligence on our Kylin vs. Kyligence comparison page.
The driving force behind Meituan’s success is not simply a robust analytics system, but the OLAP engine that system is built upon - Apache Kylin.
Cloud Analytics News will share the important news on Apache Kylin, Kyligence Cloud, and related technologies. In this edition, we cover Apache Kylin 4.X beta, the launch of Kyligence Cloud 4, Pivot to Snowflake, and more.
UnionPay was able to consolidate the 1,200 Cognos cubes into 2 Kyligence cubes and a single ETL process. Besides extending the life of the analytics executed against this data, there was a massive improvement in operational efficiency.
A peek behind the curtain of the world's leading open source big data analytics project, Apache Kylin.
An introduction to Apache Kylin's new storage and compute architecture, Apache Parquet. This article introduces Kylin's query principles, Parquet storage, and accurate duplicate removal
Already have an account? Click here to login
You'll get
A complete product experience
A guided demo of the whole process, from data import, modeling to analysis, by our data experts.
Q&A session with industry experts
Our data experts will answer your questions about customized solutions.
Please fill in your contact information.We'll get back to you in 1-2 business days.
Industrial Scenario Demostration
Scenarios in Finance, Retail, Manufacturing industries, which best meet your business requirements.
Consulting From Experts
Talk to Senior Technical Experts, and help you quickly adopt AI applications.