Introducing Real-Time Streaming Data Analysis on A Unified Analytics Platform
This article outlines some of the key highlights from Apache Kylin’s latest release, version 3.0 and was written in collaboration with Kaige Liu, a Committer in the Apache Kylin open source community and a Senior Solutions Architect at Kyligence, the enterprise-ready version of Kylin.
If you missed Kaige’s Apache Kylin 101 tutorial you can catch up on the details here.
Real-Time Streaming Data Analysis
[KYLIN-3654] - Kylin Real-Time Streaming
Version 3.0 of Apache Kylin introduced real-time streaming data analysis to the open source OLAP on big data platform. Since version 1.6, Apache Kylin has supported near-real-time streaming data analysis by treating streaming data in the same way as batch data. With this method, Kylin was able to handle streaming data with minute-level delay. However, for most business scenarios, near-real-time isn’t good enough, which is why the Kylin community pushed to get this new feature implemented as soon as possible.
Real-time streaming data analysis was developed at eBay, where Apache Kylin itself was incubated, in 2018. eBay implemented and ran this real-time solution for more than a year. Once it was stable and they had a solid performance benchmark, eBay contributed this feature to the open source community. Now that it has been implemented in Apache Kylin, the newly introduced real-time feature can implement data preparation with millisecond-level delay for streaming data.
Version 3.0 of Apache Kylin can now support sub-second-level OLAP analysis on historical and real-time data. Users can use the same OLAP platform constructed with Apache Kylin to analyze different scenarios across both batch and streaming data.
Application Scenarios for Real-Time Streaming Data Analysis
E-commerce stores need real-time analysis in order to protect their transactions. They need to be able to detect malicious transactions and block them as they happen before they can be completed. If they don’t have a way to analyze their streaming data, they cannot block such malicious transactions and leave themselves and their customers vulnerable.
IoV Industry – Rideshare Companies (e.g. Uber, Didi)
This feature can be leveraged to help keep customers safe. Companies can analyze the data collected from a car. If that car demonstrates abnormal behavior, such as taking a significantly different route than the planned route, that may indicate something has gone wrong. Now they can detect this abnormal behavior immediately and act on it to protect passengers and drivers.
IoT Industry – Device Security
Any company with many devices can collect information from those devices. By analyzing their streaming data, they can know in advance which device might have problems and change it out immediately to avoid any issues or delays in production.
By analyzing user behavior together with a user’s historical data, businesses can better understand the characteristics of this user and their interests. They can provide precision marketing solutions and forecast likely behaviors. For example, apps like TikTok or YouTube will give you recommended content by analyzing your interests and your history of watched videos together with the current video you’re watching to recommend the videos you might find most interesting to view next.
A Unified Analysis Platform
In addition to the introduction of real-time streaming data analysis, Apache Kylin now provides a way to unify batch data analysis together with streaming data analysis. Normally, if you wanted to do this, you’d have to deploy two kinds of technology together – one to do the batch data analysis and another system to do the streaming analysis.
This setup causes a lot of inconvenience because you need to maintain two different systems, use two different architectures, and end users have to combine these two result sets together before they can analyze the final results.
With the latest version of Apache Kylin, you can now combine historical data together with real-time data and use batch data to do the historical data analysis. Then, you can use the real-time feature to do the streaming data analysis. These two different data types can now be analyzed in one system with the same interface using the same tools.
You no longer need to maintain different technologies and the end user can use the same interface to query different kinds of data. This significantly reduces maintenance costs and provides a better user experience for the end user, which is not only a clear benefit for the maintainers and end users, but it also improves the data quality, which is typically the biggest weakness of streaming data.
When real-time data is streaming, you typically don’t have a chance to modify it. Apache Kylin offers a unified analysis platform that puts the batch and streaming data together. Kylin not only puts the streaming data into the analysis layer but it also writes a replica to the batch layer so the streaming data will be saved together with the historical data.
You can then change the streaming data in the batch layer if something is wrong. You can also resubmit new data to the batch layer to modify the existing data. This gives you a chance to modify or correct the raw data in your streaming data and will enhance your real-time and overall data quality.
Real-time data analysis on Apache Kylin currently cannot support Star schema or Snowflake like it does in batch data. At the moment, it only supports a single fact table.
Job Node Automation
[KYLIN-3820] - Add a Curator-Based Scheduler
In the latest release, Apache Kylin provided a new service to discover job nodes. Before this feature was implemented, one weakness of cluster deployment was that the health of the job node wasn’t guaranteed, and it wasn’t simple to correct if one failed.
Apache Kylin contains different roles of nodes – one is the query node, which is responsible for answering the queries from the end user, and another is the job node, which is responsible for scheduling and managing the jobs.
In previous versions of Apache Kylin, if your job node failed, you didn’t have any way to make another node take over the role of the job node. You couldn’t continue to build jobs and would have to restart the job node or create a new job node.
With this new feature, you don’t need to do that anymore. Previously, you had to write all of the nodes’ information in the configuration files. If you had 50 nodes in a cluster, you had to write the hostname of each node in its configuration file, which is extremely inconvenient.
A new job scheduler has been added to automatically discover the Kylin nodes and perform automatic selection among them. If a job node fails, this new feature will select a new job node to take over the job scheduling and manage the booting jobs. If you have a large Apache Kylin cluster, this will significantly reduce your maintenance work.
By using this new feature, all of the nodes can be discovered automatically and will register themselves to the system. If the node is done, it will be deregistered automatically, so you don’t need to manually register different nodes in the cluster. Maintainers of Apache Kylin clusters will be very happy about this new feature.
Integrate Apache Kylin with Apache Livy to Submit Spark Jobs
[KYLIN-3795] - Submit Spark jobs via Apache Livy
Apache Livy is another project part of the Apache Foundation. It can be used in Apache Kylin to submit Spark jobs. Apache Kylin can support MapReduce and Spark to build cubes. In previous versions of Apache Kylin, if you wanted to use Spark to build jobs, you had to run a Spark driver in the machine where the Apache Kylin job nodes run.
That means that all of the jobs would run their process in the background on the same nodes, which created a heavy workload for the job nodes and made the cluster very unstable.
This new feature allows the administrator to configure Kylin to integrate with Apache Livy so it can use Livy to submit the Spark jobs. It will offload some of the workload of Apache Kylin’s job nodes and make the cluster more stable.
You can also use Apache Livy to monitor and manage Spark resources. This change enables administrators to maintain their clusters much more easily than in the previous versions of Apache Kylin.
Apache Kylin Cares About the Security of Its Community
Very shortly after version 3.0 was released, Apache Kylin quickly put out minor release 3.0.1. The reason for this rapid follow-up release was because the community found a security vulnerability that had been reported by a user. This security vulnerability would have allowed an attacker to use Apache Kylin’s API to inject malicious SQL.
To protect users, a quick release was put out to fix this issue. As Kylin Committer Kaige Liu says, “We really take care of our users and want to protect their data.” You can find the vulnerability report on the Apache Kylin website.
Another security recommendation in this release was announced to the community regarding the Tomcat vulnerability that was recently reported. Apache Kylin now provides a solution to the Tomcat vulnerability and encourages users to set up some specific configurations to avoid the risk.
Learn more about Apache Kylin with these resources:
- Apache Kylin 3.0 Release Notes: http://kylin.apache.org/docs/release_notes.html
- Real-Time OLAP Tutorial: http://kylin.apache.org/docs30/tutorial/realtime_olap.html
- Getting Started with Apache Kylin – Quick Start Guide: https://kyligence.io/resources/apache-kylin-quick-start-guide/
- Apache Kylin Q&A with Committer Kaige Liu: https://kyligence.io/blog/how-apache-kylin-is-rapidly-changing-the-way-we-approach-big-data/