Meet Your AI Copilot fot Data Learn More
Your AI Copilot for Data
Kyligence Zen Kyligence Zen
Kyligence Enterprise Kyligence Enterprise
Metrics Platform
OLAP Platform
Customers
Definitive Guide to Decision Intelligence
Recommended
Resources
Apache Kylin
About
Partners
In the big data era, every enterprise faces the growing demand and challenge of processing large volumes of data—workloads that traditional legacy systems can no longer satisfy. With the emergence of Machine Learning, Artificial Intelligence (AI), and Internet-of-Things (IoT) technology, it has become mission critical for businesses to accelerate their pace of discovering valuable insights from their massive and ever-growing datasets.
Thus, large companies are constantly searching for a solution, often turning to open source business intelligence technologies. We will introduce two open source Big Data technologies that, when combined together, can meet these pressing business intelligence and analytics demands for large enterprises.
Modern organizations have had a long history of applying Online Analytical Processing (OLAP) on Big Data technologies to analyze data and uncover business insights. These insights help businesses make informed decisions and improve their services and products. With the emergence of the Hadoop ecosystem, OLAP technology has also embraced new capabilities in the Big Data era.
Apache Kylin is one such technology that directly addresses the challenge of conducting analytical workloads on massive datasets. It is already widely adopted by enterprises around the world. With powerful pre-calculation OLAP technology, it enables sub-second query latency over petabyte-scale datasets.
The innovative and intricate design of Apache Kylin allows it to seamlessly consume data from any Hadoop-based data source, as well as other relational database management systems (RDBMS). Analysts can use Apache Kylin using standard SQL through ODBC, JDBC, and Restful API, which enables the platform to integrate with any third-party applications.
Figure 1: Apache Kylin Architecture
With a fast-paced and rapidly-changing business environment, business users and analysts are expected to uncover insights with speed of thoughts. They can meet this expectation with Apache Kylin, and no longer subjected to the predicament of waiting for hours for one single query to return results.
Such a powerful data processing engine empowers the data scientists, engineers, and business analysts of any enterprise to find insights to help reach critical business decisions. However, business decisions cannot be made without rich data visualization. To address this last-mile challenge of big data analytics, Apache Superset comes in the picture.
Apache Superset is a data exploration and visualization platform designed to be visual, intuitive, and interactive. A user can access data in the following two ways:
Users can immediately analyze and visualize their query results using Apache Superset ‘s rich visualization and reporting features.
Figure 3: Apache Superset Visualization Interface
Both Apache Kylin and Apache Superset are built to provide fast and interactive analytics for their users. The combination of these two open source projects can bring that goal to reality on petabyte-scale datasets, thanks to pre-calculated Kylin Cube.
The Kyligence Data Science team has recently open sourced kylinpy, a project that makes this combination possible. Kylinpy is a Python-based Apache Kylin client library. Any application that uses SQLAlchemy can now query Apache Kylin with this library installed, specifically Apache Superset. Below is a brief tutorial that shows how to integrate Apache Kylin and Apache Superset.
Prerequisite
1. Install Apache Kylin
Please refer to this installation tutorial.
2. Apache Kylin provides a script for you to create a sample Cube. After you successfully installed Kylin, you can run the below script under the installation directory to generate sample project and Cube.
./${KYLIN_HOME}/bin/sample.sh
3. When the script finishes running, log onto Apache Kylin web with default user ADMIN/KYLIN; in the system page click “Reload Metadata,” then you will see a sample project called “Learn Kylin.”
4. Select the sample cube “kylin_sales_cube”, click “Actions” -> “Build”, pick a date later than 2014-01-01 (to cover all 10000 sample records);
5. Check the build progress in “Monitor” tab until it reaches 100%;
6. Execute SQL in the “Insight” tab, for example:
select part_dt,sum(price) as total_selled,count(distinct seller_id) as sellersfrom kylin_salesgroup by part_dtorder by part_dtThis query will hit on the newly built Cube “Kylin_sales_cube”.
7. Next, we will install Apache Superset and initialize it.
You may refer to Apache Superset official website instruction to install and initialize.
8. Install kylinpy
$ pip install kylinpy9. Verify your installation, if everything goes well, Apache Superset daemon should be up and running.
$ superset runserver -dStarting server with command:gunicorn -w 2 --timeout 60 -b 0.0.0:8088 --limit-request-line 0 --limit-request-field_size 0 superset:app
[2018-01-03 15:54:03 +0800] [73673] [INFO] Starting gunicorn 19.7.1[2018-01-03 15:54:03 +0800] [73673] [INFO] Listening at: https://0.0.0.0:8088 (73673)[2018-01-03 15:54:03 +0800] [73673] [INFO] Using worker: sync[2018-01-03 15:54:03 +0800] [73676] [INFO] Booting worker with pid: 73676[2018-01-03 15:54:03 +0800] [73679] [INFO] Booting worker with pid: 73679
Now everything you need is installed and ready to go. Let’s try to create a data source in Apache Superset.
1. Open up https://localhost:8088 in your web browser with the credential you set during Apache Superset installation.
Figure 5: Apache Superset Login Page
2. Go to Source -> Datasource to configure a new data source.
SQLAlchemy URI pattern is : kylin://<username>:<password>@<hostname>:<port>/<project name>Check “Expose in SQL Lab” if you want to expose this data source in SQL Lab.Click “Test Connection” to see if the URI is working properly.
Figure 6: Create an Kylin Data Source
Figure 7: Test Connection to Apache Kylin
If the connection is successful, you will see all the tables from Learn_kylin project show up at the bottom of the connection page.
Figure 8: Tables will show up if connection is successful
Go to Source -> Tables to add a new table, type in a table name from “Learn_kylin” project, for example, “Kylin_sales”.
Figure 9 Add Kylin Table in Apache Superset
2. Click on the table you created. Now you are ready to analyze your data.
Figure 10 Query Single Table From Apache Kylin
Kylin's OLAP cube is usually based on a data model joined by multiples tables. Thus, it is quite common to query multiple tables at the same time. In Apache Superset, you can use SQL Lab to join your data across tables by composing SQL queries. We will use a query that can hit on the sample cube “kylin_sales_cube” as an example.
When you run your query in SQL Lab, the result will come from the data source, in this case, Apache Kylin.
Figure 11 Query Multiple Tables From Apache Kylin Using SQL Lab
When the OLAP query returns results, you may immediately visualize them by clicking on the “Visualize” button.
Figure 12 Define Your Query and Visualize It Immediately
You may copy the entire SQL below to experience how you can query Kylin Cube in SQL Lab.
select YEAR_BEG_DT,MONTH_BEG_DT,WEEK_BEG_DT,META_CATEG_NAME,CATEG_LVL2_NAME,CATEG_LVL3_NAME,OPS_REGION,NAME as BUYER_COUNTRY_NAME,sum(PRICE) as GMV,sum(ACCOUNT_BUYER_LEVEL) ACCOUNT_BUYER_LEVEL,count(*) as CNTfrom KYLIN_SALESjoin KYLIN_CAL_DTon CAL_DT=PART_DTjoin KYLIN_CATEGORY_GROUPINGSon SITE_ID=LSTG_SITE_IDand KYLIN_CATEGORY_GROUPINGS.LEAF_CATEG_ID=KYLIN_SALES.LEAF_CATEG_IDjoin KYLIN_ACCOUNTon ACCOUNT_ID=BUYER_IDjoin KYLIN_COUNTRYon ACCOUNT_COUNTRY=COUNTRYgroup by YEAR_BEG_DT,MONTH_BEG_DT,WEEK_BEG_DT,META_CATEG_NAME,CATEG_LVL2_NAME,CATEG_LVL3_NAME,OPS_REGION,NAME
Most of the common reporting features are available in Apache Superset. Now let’s see how we can use those features to analyze data from Apache Kylin.
Sorting
You may sort by a measure regardless of how it is visualized.
You may specify a “Sort By” measure or sort the measure on the visualization after the query returns.
FilteringThere are multiple ways you may filter data from Kylin.
1. Date FilterYou may filter date and time dimension with the calendar filter.
2. Dimension FilterFor other dimensions, you may filter it with SQL conditions like “in, not in, equal to, not equal to, greater than and equal to, smaller than and equal to, greater than, smaller than, like”.
3. Search BoxIn some visualizations, it is also possible to further narrow down your result set after the query is returned from the data source using the “Search Box”.
4. Filtering the measureApache Superset allows you to write a “having clause” to filtering the measure.
5. Filter BoxThe filter box visualization allows you to create a drop-down style filter that can filter all slices on a dashboard dynamically
As the screenshot below shows, if you filter the CATE_LVL2_NAME dimension from the filter box, all the visualizations on this dashboard will be filtered based on your selection.
To provide higher performance in query time for Top N query, Apache Kylin provides approximate Top N measure to pre-calculate the top records. In Apache Superset, you may use both “Sort By” and “Row Limit” feature to make sure your query can utilize the Top N pre-calculation from the OLAP Cube.
Apache Kylin users usually need to deal with high cardinality dimension. When displaying a high cardinality dimension, the visualization will display too many distinct values, taking a long time to render. In that case, it is nice that Apache Superset provides the page length feature to limit the number of rows per page. This way the up-front rendering effort can be reduced.
Apache Superset provides a rich and extensive set of visualizations. From basic charts like pie chart, bar chart, line chart to advanced visualizations, like sunburst, heatmap, world map, Sankey diagram.
Figure 21
Figure 22
Figure 23 World Map Visualization
Figure 24 Bubble Chart
Other functionalitiesApache Superset also supports exporting to CSV, sharing, and viewing SQL query.
With the right technical synergy of open source projects, you can achieve amazing results, more than the sum of its parts. The pre-calculation technology accelerates visualization performance. The rich functionality of Apache Superset enables all OLAP Cube features to be fully utilized. When you marry the two, you get the superpower of accelerated interactive analytics.
References
Get the facts about Apache Kylin and discover how it compares to Kyligence. Learn more on our Kylin vs. Kyligence comparison page.
The driving force behind Meituan’s success is not simply a robust analytics system, but the OLAP engine that system is built upon - Apache Kylin.
Cloud Analytics News will share the important news on Apache Kylin, Kyligence Cloud, and related technologies. In this edition, we cover Apache Kylin 4.X beta, the launch of Kyligence Cloud 4, Pivot to Snowflake, and more.
UnionPay was able to consolidate the 1,200 Cognos cubes into 2 Kyligence cubes and a single ETL process. Besides extending the life of the analytics executed against this data, there was a massive improvement in operational efficiency.
A peek behind the curtain of the world's leading open source big data analytics project, Apache Kylin.
An introduction to Apache Kylin's new storage and compute architecture, Apache Parquet. This article introduces Kylin's query principles, Parquet storage, and accurate duplicate removal
Already have an account? Click here to login
You'll get
A complete product experience
A guided demo of the whole process, from data import, modeling to analysis, by our data experts.
Q&A session with industry experts
Our data experts will answer your questions about customized solutions.
Please fill in your contact information.We'll get back to you in 1-2 business days.
Industrial Scenario Demostration
Scenarios in Finance, Retail, Manufacturing industries, which best meet your business requirements.
Consulting From Experts
Talk to Senior Technical Experts, and help you quickly adopt AI applications.