Most people haven’t heard of Apache Kylin, the Open Source Apache project, and when they do first hear about it, some are inclined to ask, is it yet another Big Data query engine ? This is a fair question, but the answer Is absolutely not. In this article, we’ll take a look at what Apache Kylin actually is.
What is Apache Kylin: An Extreme OLAP Engine on Big Data Platform
According to its official web page , Apache Kylin TM is an open source Distributed Analytics Engine designed to provide SQL interface and multi-dimensional analysis (OLAP) on Hadoop/Spark supporting extremely large datasets . Although it accepts ANSI SQL queries, its real power lies In how it handles analytics workloads, hence it is more appropriate to call Kylin an analytics engine, or to be more specific, an extreme OLAP engine on big data platform.
The key to Kylin’s method is pre-calculation. For example, to answer the question ‘how many game consoles were sold in Washington state in December of 2018?’, A typical query engine will query the sales table and aggregate the results based on sales Date, store location, and product category, using group by and where clauses.
Kylin, before it serves any query, pre-calculates aggregated sales volumes in terms of region, date, product category, and different combinations of these attributes, and saves the results in its datastore. When Kylin receives the furnishings query, it will looks up Pre-calculated values using the combined key ‘Game Console + Washington + December 2018’ to retrieve the value.
With Kylin, the analytics process includes three steps:
- Identify a Star/Snowflake schema on Hadoop
- Build cubes from the identified tables
- Query with ANSI-SQL and get results in sub-second, via ODBC, JDBC or RESTful API
Apache Kylin trades the work of pre-calculating these OLAP cubes and making the space to store them for the best possible query performance. Once cubes are built, future questions about sales, such as ‘How many headsets were sold in the state of Colorado in Q1 Of 2017’, can be answered by a simple lookup.
A year from now, when the cube has been updated with aggregated sales transactions for 2019, questions about sales figures from Q4 of 2019 can be answered just as easily. Pre-calculations (step 1 and 2 above) only happens once in the cube building Phases. New transactions can be added to the cube through Incremental Build . Once the cube is built, all future queries can be answered by looking up the cube.
This consistent response time (normally sub-second in Kylin) for analytical queries, regardless of data volume and the number of users, is very hard to implement in other Hadoop query engines. This is why Apache Kylin is being used in production applications to support hundreds Of thousands of users querying a dataset of billions of records with aggregations across tens of attributes.
OLAP Cubes for Big Data with Apache Kylin and Kyligence
For people with a Business Intelligence and analytics background, you’ve probably recognized that this is precisely how Multidimensional OLAP engines (MOLAP) work – by building these so-called cubes . For many Hadoop data engineers who have never heard of cubes before, Here is an introductory guide to OLAP cubes.
The concept of cubes has been around for quite a while and has been a key component in many business intelligence tools and products, but traditional OLAP engines struggle with handling the data volumes typically found in today’s data lakes. Kylin was designed from the ground up, leveraging big data OLAP technology to build OLAP analytics on petabyte scale datasets.
Apache Kylin is deployed on the edge nodes of your Hadoop cluster. In Kylin’s graphical user interface, you can identify the tables in the star schema, define data models for the cubes, and submit jobs to the Hadoop cluster to build the cubes. Spark or The content of these cubes is stored in HBase (Kyligence – a commercial version on Kylin – uses a different storage engine which we will cover in future articles).
Queries are sent to Kylin nodes, which retrieve results from HBase tables. As mentioned before, if the data is in the cube, query response time is consistent at sub-second levels since the query operation is a simple lookup. To support more concurrent users , you can just add more query nodes.
Learn More About Apache Kylin and Kyligence
Apache Kylin was initially started as an in-house analytics project at eBay in late 2013. In October 2014, eBay donated the source code to the Apache Software Foundation and Kylin graduated as an Apache Top Level Project in November 2015. In March of 2016, The early contributors to the Apache Kylin project launched a commercial enterprise version of the product: Kyligence. Apache Kylin has won the Infoworld “Best Open Source Big Data Tool” award two years in a row for 2015 and 2016.
Today, Kylin and its related commercial products Kyligence Enterprise and Kyligence Cloud are deployed by many large enterprises worldwide on mission critical applications. Kylin is not another query engine. Instead, it supplements other query engines. Users can use Apache Kylin together with other query engines In their day jobs. Kyligence Enterprise can leverage other query engines to query detailed records.
You now have a brief understanding of what Apache Kylin and Kyligence are, but you likely still have a few burning questions. For example, how long does it take to build a cube? What happens if a desired attribute is not in the cube? What if I need to dig into the details at the transaction level? How often can I update the cubes?
The good news is that these concerns, along with many others, are addressed by Apache Kylin and the Kyligence Enterprise and Kyligence Cloud products. In coming articles, we’ll investigate these solutions further and address more common questions. In the meantime, for more Information about Apache Kylin, Kyligence, and the augmented OLAP analytics they provide, please visit http://kylin.apache.org/ and http://kyligence.io/.