this is a demo - Kyligence

Why Did Meituan Develop Apache Kylin on Druid?

Feb. 28, 2022

In the Big Data field, Apache Kylin and Apache Druid (incubating) are two commonly adopted extreme OLAP Big Data engines, both of which enable fast querying on huge datasets. In the enterprises that heavily rely on big data analytics, they often run both for different use cases. During the Apache Kylin Meetup in August 2018, the Meituan team shared their Kylin on Druid (KoD) solution. Why did they develop this hybrid system? What’s the rationale behind it? This article will answer these questions and help you to understand the differences and the pros and cons of each OLAP engine.

Meituan is a Chinese shopping platform for locally found consumer products and retail services including entertainment, dining, delivery, travel and other services.

Introduction to Apache Kylin

Apache Kylin is an open source distributed big data analytics engine. It constructs data models on top of huge datasets, builds pre-calculated OLAP cubes to support multi-dimensional analysis, and provides a SQL query interface and multi-dimensional analysis on top of Hadoop, with general ODBC, JDBC, and RESTful API interfaces. Apache Kylin’s unique pre-calculation ability enables it to handle extremely large datasets with sub-second query response times.

apache-kylin-usercase-meituan

Apache Kylin’s Extreme OLAP Analytics Advantage

1.The mature, Hadoop-based computing engines (MapReduce and Spark) that provide strong capability of pre-calculation on super large datasets, which can be deployed out-of-the-box on any mainstream Hadoop platform.
2. Support of ANSI SQL that allows users to do data analysis with SQL directly.
3. Sub-second, low-latency query response times.
4. Common OLAP Star/Snowflake Schema data modeling.
5. A rich OLAP function set including Sum, Count Distinct, Top N, Percentile, etc.
6. Intelligent trimming of Cuboids that reduces consumption of storage and computing power.
7. Support of both batch loading of super large historical datasets and micro-batches of data streams.

Introductionto Apache Druid (incubating)

Druid was created in 2012. It’s an open source distributed data store. Its core design combines the concept of analytical databases, time-series databases, and search systems, and it can support data collection and analytics on fairly large datasets. Druid uses an Apache V2 license and is an Apache incubator project.

Apache Druid Architecture

From the perspective of deployment architectures, Druid’s processes mostly fall into 3 categories based on their roles.

Data Node (Slave node for data ingestion and calculation)

The Historical node is in charge of loading segments (committed immutable data) and receiving queries on historical data.

Middle Manager is in charge of data ingestion and commit segments. Each task is done by a separate JVM.

Peon is in charge of completing a single task, which is managed and monitored by the Middle Manager.

Query Node

Broker receives query requests, determines on which segment the data resides, and distributes sub-queries and merges query results.

Master Node (Task Coordinator and Cluster Manager）

Coordinator monitors Historical nodes, dispatches segments and monitor workload.

Overlord monitors Middle Manager, dispatches tasks to Middle Manager, and assists releasing of segments.

External Dependency

At the same time, Druid has 3 replaceable external dependencies.

Deep Storage (distributed storage)

Druid uses Deep storage to transfer data files between nodes.

Metadata Storage

Metadata Storage stores the metadata about segment positions and task output.

Zookeeper (cluster management and task coordination)

Druid uses Zookeeper (ZK) to ensure consistency of the cluster status.

Kyligence