KyBot Case Analysis

11 - 25 - 2016

We have met a case that an e-commerce company had run Apache Kylin over 21 nodes, they turned to KyBot and looking forward to a diagnose about their cluster health status and advices on Cube optimization. KyBot analyzed a five-day time span diagnostic pack, which including 98 Cubes as well as over 7000 Queries, ultermately identified the bottleneck of query process and optimized it.

Overall situation

98 Cube, its total size is 871GB and with a quite low expansion rate;
Over 86% queries are completed within 1 second, 98% queries are completed within 5 second. Besides, average query performance goes greatly;
The medium of building’s time comsumed is below 15mins, which is in the normal range;


SQL Execution Analysis

Select CATEGORY_LV1, sum(order_amt) as order_amt, sum(payment_amt) as payment_amt, sum(discount_amt) as discount_amt, sum(shipping_fee) as shipping_fee, sum(tax_amt) astax_amt, sum(coupon_amt) as coupon_amt, count(distinct CUSTOMER_ID) as uv, count(distinct SHIPPING_AGT_ID) as shipping_agt, count(distinct province_id) as province from t_sales_order WHERE PART_DT > ’20160901’ and PART_DT < ’20161001’ group by CATEGORY_LV1 order by CATEGORY_LV1

Cube Index’s Matching Degree Analysis

There are 8 dimensions displayed on the SQL executing process page, showing that 8 dimensions within a Cube has been involved in the SQL execution. Only the PART_DT is an effective filtering dimension and the CATEGORY_LV1 is a working aggregation group dimension, the rest dimensions are mandatory to participate execution yet not matchable for the case, causing a low match degree overall(25%). Uses are capable to cancel this mandatory setting or use joint instead.

The filtering dimension PART_DT is at the end of the index combination; considering that front dimensions have an ultra high cardinality that leading a costly filtering time comsuming and inefficient query, we sincerely suggest users update ranking of index dimensions.


SQL Execute Life Cycle Analysis

As the SQL executing life cycle analysis diagram shows, the blue part refers to SQL works on parallel scaning among multiple storage nodes, the green part refers to SQL executtions on query nodes. The green part is much longer, suggesting that executing query bottleneck is on query nodes. Due to the reason before, we advise users to reduce data post aggregation pressure or improve query nodes performance.


Optimizing Action

Add a Hierarchy aggregation group, which includes CATEGORY_LV1 and CATEGORY_LV2; and add a Joint aggregation group, which includes SELLER_ID and SHOP_NAME to cancle the defaulted Mondatory aggregation group.

Optimizing Result

Queries hit a more matchable Cuboid, leading to a great improvement on query efficiency, and query node’s running time has been shorten to 0.4 seconds.

Customer Feedback

According to KyBot’s analysis, correspondingly, we optimized Cube and increased internal storage of query nodes. Comparison tests show that the query efficiency is significantly improved. Next, we will continue to optimize Cube according to the further analysis.

Recent Post



作者陈高英 陈高英:银兴科技大数据工程师,目前主要负责Hadoop大数据平台的集群规划、咨询和实施工作,参与过多家行业知名客户的大数据平台规划设计与实施工作。 1.文档编写目的 我们在前面的文章简单介绍过Apache Kylin,请参考《如何在CDH中部署及使用Kylin》,文章中包含了如何在CDH上部署Kylin,以及创建cube,然后进行查询的两个demo例子。但对于CDH的生产系统,往往都会部署配置安全多租户,即Kerberos+Sentry,当CDH启用了Kerberos后,对于外部系统比如Kylin对接到CDH,我们就需要考虑安全认证的问题。本文主要描述如何在启用Kerberos的CDH集群中如何部署及使用Kylin。 •内容概述 1.下载Kylin 2.部署Kylin 3.Demo1 •测试环境 1.RedHat7.2 2.CM/CDH5.14.2 3.Apache Kylin2.3.1 4.集群启用Kerberos •前置条件 1.CDH集群正常运行 2.Hive,HBase服务运行正常 3.安装Kylin服务的节点已经部署Hive Gateway、HBase Gateway角色 2.下载Kylin 社区版kylin下载地址: ,本次测试使用apache-kylin-2.3.1 3.部署Kylin 这里测试只部署了一个节点,实际生产可以部署多个节点,kylin是无状态的,前端做负载均衡,kylin的负载均衡配置后续会有文章详细介绍。 1.上传kylin安装包至服务器 2.解压至/usr/local目录,并建软链接 [root@nn ~]# tar -zxvf apache-kylin-2.3.1-cdh57-bin.tar.gz -C /usr/local/ [root@nn ~]# cd /usr/local/ [root@nn local]# ln -s apache-kylin-2.3.1-bin/ kylin  3.Kylin环境配置 主要配置kylin的home目录及java环境变量,配置如下: [root@nn kylin]# vim /etc/profile export JAVA_HOME=/usr/java/jdk1.7.0_67-cloudera export KYLIN_HOME=/usr/local/kylin […]
Read More

Get Your Interactive Analytics Superpower, with Apache Kylin and Apache Superset (Incubating)

Get Your Interactive Analytics Superpower, with Apache Kylin and Apache Superset (Incubating)

Author: Yongjie Zhao, Joanna He Challenge of Big Data In the big data era, every enterprise faces the growing demand and challenge of processing large volumes of data—workloads that traditional legacy systems can no longer satisfy.
Read More

Integrating Kyligence Analytics Platform Plus 2.5 with Microsoft Azure HDInsight

Integrating Kyligence Analytics Platform Plus 2.5 with Microsoft Azure HDInsight

Introducing Kyligence Analytics Platform Plus Kyligence Analytics Platform (KAP), powered by Apache Kylin, is an enterprise OLAP engine on Hadoop.
Read More