How Apache Kylin Is Rapidly Changing the Way We Approach Big Data

Author
Samantha Berlant
Communications Manager, Kyligence Marketing
Apr. 03, 2020

Kaige Liu is an official Committer for the open source top-level Apache Kylin project. He is passionate about improving the project and actively helps the community. When it comes to OLAP on big data, as Kaige puts it, “Apache Kylin is definitely the best.” 

Kaige Liu
Kaige Liu

If you're ready to get started with Kylin, keep reading, and if you want to begin working with it right now, check out our Apache Kylin Quick Start Guide.

Jump to the following topics:

  1. General Information about Apache Kylin and the Community
  2. How Apache Kylin Works and It's Capabilities
  3. Who's Adopting Kylin and Relevant Use Cases
  4. Apache Kylin vs. Commercial Solutions
  5. What's New with Apache Kylin


Q&A with Apache Kylin Committer, Kaige Liu

What exactly is Apache Kylin? 

Apache Kylin is a tool for OLAP on big data. OLAP tools have been around for 20+ years and are a proven solution for a lot of companies when it comes to making BI decisions and performing big data analysis. What Apache Kylin does is apply this OLAP theory in the area of big data.

If you already use OLAP tools in your traditional technology and want to use the same with your big data, the only choice you have is Apache Kylin. We do have some competitors, but when it comes to OLAP, Apache Kylin is definitely the best one in the world that you can find.


Kylin Diagram
An Overview of Apache Kylin

When did you begin working on Apache Kylin? 

I joined the Apache Kylin community when I joined Kyligence in 2016. Before that, I knew Apache Kylin was a great project and it was the reason I chose to work at Kyligence. Previously, I was working on cloud computing and virtualization, so this was the first time I worked in big data.  


When did you first hear about the open source project? 

When I first learned about Apache Kylin, it was still in the incubation stage at eBay. At that time, I hadn’t participated in or contributed to the project. I just knew there was a project working on OLAP on big data and I found it pretty interesting. I read about it but never had a chance to take a deep look into it until I joined Kyligence and met the creators of the Apache Kylin project.  


How did you get started with Apache Kylin?  

The initial team members of Kyligence are the founders of Apache Kylin. They taught me how to contribute to the community and work on the project, and I’ve been working on this great project since then. 


Who runs Apache Kylin? 

Apache Kylin is one project under the Apache Foundation. It is run by a group we call the PMC (Project Management Committee). There are a bunch of people in the PMC and they manage the whole project. They have the authority to determine which direction the project should go and are responsible for maintaining the lifecycle of the project. 


Who else works on this open source project? 

Other than the PMC, there are a bunch of people who work on Apache Kylin called Committers. I have just been invited to become a Committer, which means I have the authority to merge and review other Contributors’ code and help them contribute to the community. 


Are there other ways for people to contribute to the community? 

If you’re not a Committer, you can still work on the project. We call participants at this level Contributors and anyone can become a Contributor. There are many ways to participate at this level. You can write a blog on the project to introduce it to others, answer questions in the community, help others with their projects and anything else that supports the project.  


Apache Kylin Logo for Kyligence Comparison Guide
Apache Kylin Community Logo

How is Apache Kylin structured as an organization?  

If you do anything to support the community, you are a Contributor. When the community thinks you’ve contributed a lot, they will invite you to become a Committer. If you work a lot on the project as a Committer, they will invite you to become a PMC member. Then you will have the authority to determine the future of the project. The PMC, Committer, Contributor structure is specific to the Apache Foundation.  


How do I start contributing to Apache Kylin? 

It is very easy to get started. Apache Kylin has a lot of information available on the website and the community is very supportive.


What stood out to you about Apache Kylin? Why did you choose this project? 

It’s how I found my direction in big data. Working on Apache Kylin helped me learn about other big data technologies because it contains a lot of common tools like Hadoop, HBase and Hive.

It gives you a whole picture of the big data space so you can see which part of the technology you are most interested in, and then dive into that part. Contributing to Apache Kylin gave me the opportunity to quickly become an expert in big data. 


What is the main takeaway people should understand about Apache Kylin? 

The first thing we always mention about Apache Kylin is the performance. Most of our users chose this platform because they ran into issues dealing with very large data volumes on their previous systems.

Apache Kylin was designed from the beginning to handle massive volumes of data. More and more companies are migrating their data from traditional technology to big data technology and seeking a solution that can handle their growing amounts of data, that’s where Apache Kylin comes in. 


Why should people care about Apache Kylin and your work? 

I think the quick answer is: People should care about Apache Kylin because this product can solve their real problems. The number and size of our users demonstrate this. More than 1,000 users have already adopted this project as their production solution, which means we can solve their production issues in their real-world scenarios. 


More About How Apache Kylin Works

How does Apache Kylin improve performance?  

Apache Kylin deals with data in much the same way as traditional technology, however, when you implement traditional technology in the traditional way, you cannot deal with a lot of data. You can only work on a single machine and are limited by its memory and CPU and you can’t get better performance when your data grows.  

To deal with this problem, Apache Kylin transitions the traditional technology to a distributed cluster and uses big data technology like Hadoop to provide better performance on larger data volumes. 

It also leverages pre-calculation to improve query performance and concurrency. Instead of scanning and calculating data on the fly, it stores aggregated data in cubes, which means it can achieve sub-second query latency even on trillions of rows of data. 


How much better is the performance on Apache Kylin? 

“Sub-second query response on massive datasets” is a good way to describe the performance Apache Kylin or Kyligence can provide. If you work in data analysis, you will know how hard it is to get a sub-second response. 


Who's Using Apache Kylin?

https://youtu.be/IVVIXYd2EIA
Use Cases for Apache Kylin in APAC

What are some interesting use cases of Apache Kylin? 

We currently have over 1,000 users worldwide who have already adopted Apache Kylin. There are many famous examples like Yahoo! Japan and Amazon, as well as a lot of companies in China such as Baidu, a search engine company like Google, Alibaba, the biggest e-commerce company in China, and Didi, which is like Uber. 


Who does Apache Kylin help most? 

I think Apache Kylin can most help data engineers and data analysts in their work.


Can you give an example of how Apache Kylin makes work easier for data engineers? 

When the end-user wants some data, they give their requirements to the data engineer. The engineer has to understand the whole scenario to know which database and which tables to get the data from to fulfill the requirements, which does not come naturally because they are more familiar with code and data than the end-users’ logic and business terms.

Apache Kylin can make their job significantly easier. It frees the engineer from this workflow and allows them to focus their time and energy on more efficient, productive efforts.


What changes about this workflow when a data engineer has access to Apache Kylin? 

With Apache Kylin, data engineers don’t need to maintain all of their jobs to fetch the table because the table has already been defined in Apache Kylin. All the jobs to fetch the data and analyze the data have been automatically generated by Apache Kylin; they don’t need to maintain the MapReduce jobs or Spark jobs to fetch this data.

They can even expose the API to the end-user so the end-user can fetch the data or define the data where they want themselves directly, without communicating with the engineer on what kind of data they want.  

The engineer can then turn their focus from understanding the business scenarios and fetching the data to maintaining this framework to provide this system to the end-user that allows them to fetch data by themselves. That’s what we call self-serving data analysis, and it makes the data engineer’s job much easier and way more efficient. 


You also mentioned data analysts. How does this new workflow benefit them? 

The same process that makes the data engineer’s work very hard in the traditional structure is also a struggle for the end-user, usually a BI analyst or a data analyst. If you are a data analyst, you want to fulfill your boss’ requirements. For example, your boss wants to see some numbers relating to the business and this request is usually very urgent.

You have to design the business logic to calculate this number before you can give it to the CEO. Decision-makers don’t want to, and can’t, wait for a week, or half a month, to get a result like this. Usually, this number needs to be given to them in less than one day so they can use this data to make a decision. If it takes too long, the data and decision become irrelevant. 


https://youtu.be/18CDJm9OVjQ
An Overview of Apache Kylin's Extreme OLAP Engine

What does Apache Kylin change for data analysts? 

The typical workflow for analysts requires that they spend a lot of time discussing the scenario with the data engineer who must then implement the code, maintain the job, and fetch the data to get the result. Often, this takes more than one week or sometimes two weeks. 

When you adopt Apache Kylin, you significantly reduce the time to gather the data and the time from requesting the data to an actionable business result. It usually takes one day, or a half-day, to get a result with Apache Kylin. That helps the analyst a lot because all of the requirements from your end-user, from your boss, and from your customers, can be fulfilled in a shorter time than they would have before. 


Open Source vs. Commercial

Are there downsides to going with the open source version instead of the commercial one? 

Open source projects are all about technological innovation. They can provide you with fancy new tools, but they also may not be secure or stable. 


What does it take for a business to run a successful production environment on Apache Kylin? 

When you are using an open source project for your production environment, you have to put a lot of effort into maintaining it and fixing bugs, or you have to maintain your own branch, which means you need to maintain your own code other than the main branch of the community.  


Apache Kylin Users
A Sample of Companies Leveraging Apache Kylin

How do most of the current users maintain their own branch? 

A lot of customers using Apache Kylin in their production environment have a strong development team. They have developers maintaining their own code. They fork this code from the community’s main branch, the master branch, and maintain their own branch.

They have to do this because companies have a lot of features and security requirements in their production environment that aren’t provided by any open source technology.   

Open source projects are only focused on advancing the features of the technology. What companies want is, not only the performance, not only the features, not only the fancy new tools – what they want the most is security while they use this tool to increase their business’ efficiency.  


Open source: 

  • Requires effort to maintain the production environment 
  • Requires a team of developers who can maintain your own branch with your own code  
  • Need to fix bugs 
  • Need to be able to add your own security requirements  
  • Need to be able to add your own custom features 
  • No guaranteed stability
  • No guaranteed security 
  • No support provided


What does Kyligence offer that Apache Kylin can’t? Why do some users find that it is in their best interest to go with the commercial version instead of the open source option? 


#1 Ease of Use 

The first reason is the amount of effort you want to put into the product. Kyligence provides a commercial version of Apache Kylin that is easy to use. Using the enterprise version of this product saves businesses a lot of effort, time and resources when it comes to maintaining their tools.

You may only need one person, or half of a person, to maintain this product instead of the team of developers you need to maintain the open source version. That will save you a lot of money.  

  • Easy to use 
  • Saves time, effort and resources that would have been spent maintaining the tool 


#2 Security 

The second reason users choose the enterprise version over the open source tool is security. Open source projects don’t guarantee a secure solution. They also don’t integrate with companies’ security systems, but the commercial version has to provide such features.

When a customer decides to buy this product, they evaluate our capabilities regarding their security. We have to show them we have already passed a lot of security testing and that Kyligence can integrate with common security frameworks in their market.  

  • Guarantees a secure solution 
  • Integrates with common security frameworks


#3 Service/Support 

The third reason companies choose Kyligence over Apache Kylin is service. You won’t receive service or support from an open source tool. 

  • Support teams are available to assist with bug fixes and product customization 


How do users solve their issues on Apache Kylin if there is no support available? 

If you have an issue while using Apache Kylin, you have two choices. The first option is that you fix this issue yourself. That’s why a lot of companies that use open source tools have a strong development team to fix any issues that come up.  

Another option you have is to go to the community. That’s what I did in the Apache Kylin community as a Contributor. I helped other users find answers to their questions, but this is not a guaranteed solution. I contribute to the community in my spare time; it’s not my full-time job and I am not able to take responsibility for fixing a user’s issues. 

With the commercial version, however, we have a designated customer success service team. You are guaranteed to have a stable production environment. If you run into some issues, we can provide you with 24/7 service. 


#4 Your Features Prioritized 

The fourth reason companies go with the commercial version is to have their features prioritized and their unique needs addressed.  

You can’t control the roadmap of the open source project. The PMC controls the roadmap and they won’t consider your business scenarios when they design it. They won’t discuss with your company what features or scenarios you want to meet in the next release. Open source projects only focus on advancing their technology.  

When you use the commercial version, you have a relationship and an agreement with Kyligence. This means you can get your features set as a high priority for a designated release.

For example, some of our customers give us annual feedback and send us their specific requirements. We make their requirements our first priority in the following releases of our product. We incorporate our customers’ requirements into our roadmap and give the client an ETA for when they can expect the new feature they requested to be live.

We are very clear about our roadmap, which provides a stability to our clients that is not available in the open source. 


#5 Integration with Your Ecosystem 

Another thing for users to consider when choosing a solution is their ecosystem. Apache Kylin doesn’t really maintain an ecosystem because they don’t have the resources to focus on that. They only provide some integration with common BI tools and data sources, but in the commercial version we have a lot of partners and these partners help us build ecosystems for different users.

You need a solution that fits your organization regardless of the BI tools and data sources you use or whether you are on-premises or on the Cloud. If you use Azure, yep, Microsoft is our partner, and together we have built an optimal product that gives you the best of stability, performance and scalability.


What’s new with Apache Kylin?

Ready to get started? Download the latest version of Apache Kylin here: http://kylin.apache.org/docs/release_notes.html  

You can also refer to our Apache Kylin Quick Start Guide here: https://kyligence.io/resources/apache-kylin-quick-start-guide/


The latest improvements include: 

  • Segments of not only streaming cube but also batch cube need to show their status 
  • DiagnosisInfoCLI block forever at “beeline –version” 
  • Unclosed Hive session causes too many temp files 
  • Return error when execute “explain plan for SQL” to get the execution plan of SQL 
  • SegmentPruner add the checks for “OR” filtering 
  • SegmentPruner cannot prune a segment with “IN” or “OR” CompareTupleFilter 
  • Use HFileOutputFormat3 in all places to replace HFileOutputFormat2 
  • TopN Comparator may violate its general contract 
  • Build Server OOM 
  • Fix security issues reported by code analysis platform LGTM 


Latest bug fixes: 

  • Project schema update event causes error reload NEW DataModelDesc 
  • Exception in update metrics when the response is null 
  • Kylin parse SQL error 
  • Failed to load table metadata from JDBC data source 
  • kylin_streaming_model broke when changing kylin.source.hive.database-for-flat-table to non-default value 
  • Read function in NoCompressedColumnReader is wrong 
  • FechRunnner should skip the job to process other jobs instead of throwing an exception when the job section metadata is not found 
  • Fix the error “Cannot read property ‘index’ of null” on the visualization page 
  • When using server-side PreparedStatement cache, the query results do not match TopN scenario 
  • Instances displayed on the Query Node are inconsistent with Job Node 
  • Build cube throw NPE error when partition column is not set in JDBC Data Source 
  • Create a Real-time streaming cube but not define a partition column should throw an exception 
  • Project list cannot be correctly sorted by “Create Time” 
  • Param Value should be required when creating a cube and adding a new measure 


Learn More About Apache Kylin


Apache Kylin Quick Start Guide