Apache Kylin – Yet Another Hadoop Query Engine?

Li Kang|01 - 08 - 2019

Most people haven’t heard of Apache Kylin, the Open Source Apache project, and when they do first hear about it, some are inclined to ask,  i s it yet another Big Data query engine ? This is a fair question, but the answer Is absolutely not. In this article, we’ll take a look at what Apache Kylin actually is. 


What is  Apache  Kylin :  An OLAP Engine on Big Data Platform 

According to its  official web pageApache Kylin TM is an open source Distributed Analytics Engine designed to provide SQL interface and multi-dimensional analysis (OLAP) on Hadoop/Spark supporting extremely large datasets . Although Kylin accepts ANSI SQL queries, its real power lies In how it handles analytics workloads, hence it is more appropriate to call Kylin an analytics engine, or to be more specific, an OLAP engine on big data platform. 


The key to Kylin’s method is  pre-calculation.   For example, to answer the question ‘how many game consoles were sold in Washington state in December of 2018?’, A typical query engine will query the sales table and aggregate the results based on sales Date, store location, and product category, using  group by  and  where  clauses.  


Kylin, before it serves any query, pre-calculates aggregated sales volumes in terms of region, date, product category, and different combinations of these attributes, and saves the results in its datastore. When Kylin receives the furnishings query, it will looks up Pre-calculated values ​​using the combined key ‘Game Console + Washington + December 2018’ to retrieve the value. 


With Kylin, the analytics process includes three steps: 

  1. Identify a Star/Snowflake schema on Hadoop 
  2. Build cubes from the identified tables 
  3. Query with ANSI-SQL and get results in sub-second, via ODBC, JDBC or RESTful API


Apache Kylin Diagram


Apache Kylin trades the work of pre-calculating these cubes and making the space to store them for the best possible query performance. Once cubes are built, future questions about sales, such as ‘How many headsets were sold in the state of Colorado in Q1 Of 2017’, can be answered by a simple lookup.  


A year from now, when the cube has been updated with aggregated sales transactions for 2019, questions about sales figures from Q4 of 2019 can be answered just as easily. Pre-calculations (step 1 and 2 above) only happens once in the cube building Phases. New transactions can be added to the cube through  Incremental Build . Once the cube is built, all future queries can be answered by looking up the cube. 


This consistent response time (normally sub-second in Kylin) for analytical queries, regardless of data volume and the number of users, is very hard to implement in other query engines. This is why Apache Kylin is being used in production applications to support hundreds Of thousands of users querying a dataset of billions of records with aggregations across tens of attributes. 


OLAP Cubes for Big Data with Kylin and Kyligence 

For people with a Business Intelligence (BI) background, you’ve probably recognized that this is precisely how Multidimensional OLAP engines (MOLAP) work – by building these so-called  cubes . For many Hadoop data engineers who have never heard of cubes before, Here  is an  introductory guide to OLAP cubes  


The concept of cubes has been around for quite a while and has been a key component in many BI products, but traditional OLAP engines struggle with handling the data volumes typically found in today’s data lakes. Kylin was designed from the ground up leveraging big data technologies To build OLAP analytics on petabyte scale datasets.  


Apache Kylin is deployed on the edge nodes of your Hadoop cluster. In Kylin’s graphical user interface, you can identify the tables in the star schema, define data models for the cubes, and submit jobs to the Hadoop cluster to build the cubes. Spark or The content of these cubes is stored in HBase ( Kyligence  – a commercial version on Kylin – uses a different storage engine which we will cover in future articles ). 


Queries are sent to Kylin nodes, which retrieve results from HBase tables. As mentioned before, if the data is in the cube, query response time is consistent at sub-second levels since the query operation is a simple lookup. To support more concurrent users , you can just add more query nodes. 


Learn More About  Apache  Kylin and Kyligence 

Apache Kylin was initially started as an in-house analytics project at eBay in late 2013. In October 2014, eBay donated the source code to the Apache Software Foundation and Kylin graduated as an Apache Top Level Project in November 2015. In March of 2016, The creators of the Apache Kylin project launched a commercial enterprise version of the product: Kyligence. Apache Kylin has won the Infoworld “Best Open Source Big Tool” award two years in a row for 2015 and 2016.  


Today, Kylin and its related commercial products  Kyligence Enterprise  and  Kyligence Cloud  are deployed by many large enterprises worldwide on mission critical applications. Kylin is not another query engine. Instead, it supplements other query engines. Users can use Apache Kylin together with other query engines In their day jobs. Kyligence Enterprise can leverage other query engines to query detailed records.    


You now have a brief understanding of what Apache Kylin and Kyligence are, but you likely still have a few burning questions. For example, how long does it take to build a cube? What happens if a desired attribute is not in the cube? What if I need to dig into the details at the transaction level? How often can I update the cubes?  


The good news is that these concerns, along with many others, are addressed by Apache Kylin and the Kyligence Enterprise and Kyligence Cloud products. In coming articles, we’ll investigate these solutions further and address more common questions. In the meantime, for more Information about Apache Kylin and Kyligence, please visit  http://kylin.apache.org/  and  http://kyligence.io/ 

Recent Post

How Cisco’s Big Data Team Improved Apache Kylin’s High Concurrent Throughput by 5X

How Cisco’s Big Data Team Improved Apache Kylin’s High Concurrent Throughput by 5X

Background As part of the development group for Cisco’s Big Data team, one of our responsibilities is to provide BI reports to our stakeholders. Stakeholders rely on the reporting system to check the usage of Cisco’s business offerings. These reports are also used as a reference for billing, so they are critical to our stakeholders […]
Read More

Why did Meituan develop Kylin On Druid (part 1 of 2)?

Why did Meituan develop Kylin On Druid (part 1 of 2)?

Preface   In the Big Data field, Apache Kylin and Apache Druid (incubating) are two commonly adopted OLAP engines, both of which enable fast querying on huge datasets. In the enterprises that heavily rely on big data analytics, they often run both for different use cases.   During the Apache Kylin Meetup in August 2018, the Meituan […]
Read More

Apache Kylin v2.5.0 Release Announcement

Apache Kylin v2.5.0 Release Announcement

Sep 20, 2018 • Shaofeng Shi The Apache Kylin community is pleased to announce the release of Apache Kylin v2.5.0. Apache Kylin is an open source Distributed Analytics Engine designed to provide SQL interface and multi-dimensional analysis (OLAP) on Big Data supporting extremely large datasets. This is a major release after 2.4.0. There are many […]
Read More