Meet Your AI Copilot fot Data Learn More

Integrating Kyligence Enterprise Big Data Platform with Microsoft Azure HDInsight

Author
Kyligence
Jul. 03, 2018

Introducing Kyligence Extreme OLAP Analytics Platform Plus

Kyligence Analytics Platform (KAP), powered by Apache Kylin, is an extreme OLAP engine on Hadoop for enterprise. KAP enables sub-second SQL query latency on petabyte-scale datasets, provides high concurrency at web-scale, and empowers analysts to architect Big Data Analytics tools and techniques with industry-standard data warehouse and business intelligence and analytics methodology.

KAP introduces enterprise features including enhanced security, patented columnar storage, and out-of-box agile BI tool. KAP is a native OLAP on Hadoop solution that supports main Hadoop distributions from on-premises, in the cloud, and hybrid.

KAP reads source data from Apache Hive or Apache Kafka, builds data Cubes with Hadoop MapReduce or Apache Spark, and then persists the Cube in Apache HBase or KyStorage. The BI tools, dashboards, or web applications send SQL queries to KAP via JDBC/ODBC drivers or Restful API. KAP will translate it to Cube visiting and then return results in low latency.

KAP Plus, with the KyStorage as the storage engine and Spark as query engine, can work without HBase and can be 3x to 40x faster than Apache Kylin. KyStorage is an optimized columnar storage engine, which can directly be read/written from/into HDFS, Azure Blob Store and Azure Data Lake. With KAP Plus, user can gain not only better performance, but also scalability and flexibility.

Figure 1 shows the query performance of Apache Kylin (same core as KAP) with the Star Schema Benchmark dataset on 10 million, 20 million, and 40 million rows data. All the queries are finished in less than 1 second, and the performance is stable as the data grows.

Apache Kylin SSB Query Performance

Figure 1. SSB Query Performance

With the pre-calculation technology, Apache Kylin is more than 300 times faster than Apache Hive in the SSB scenario, as the Figure 2 shows. For more information, you can check https://github.com/Kyligence/ssb-kylin .

Query Latency of Apache Kylin vs. Apache Hive

Figure 2. Query Latency of Kylin vs Hive (Small is Better)

Now KAP Plus 2.5 has a lot of enhancements over the previous version, includes:
1) A newly designed Web GUI;
2) Model advisor for creating data model from SQL statements;
3) Cube optimizer for tune the Cube design;
4) Project and table level access control.

Azure HDInsight

Backed by a 99.9% SLA, Microsoft Azure HDInsight is the only fully-managed cloud Apache Hadoop offering that gives you optimized open-source analytic clusters for Spark, Hive, MapReduce, HBase, Storm, Kafka, and Microsoft R Server. This architecture separates computing and storage, and since the data is persisted in Azure Blob Storage or Azure Data Lake, customers can easily start/stop and scale/shrink the cluster on demand without worrying about data loss. Azure Blob Storage and Azure Data Lake are highly available, highly scalable, high capable, HDFS compatible, low-cost storage services.

Kyligence Big Data Analytics Platform (KAP Plus) on HDInsight

With Azure HDInsight, you can easily deploy KAP in a few minutes. KAP can run on any node of the cluster, communicating with Hadoop services (Hive/YARN/MapReduce, etc.) via standard protocols. KAP automatically finds the services with the client configurations and then starts working. No additional installation/configuration is needed on Hadoop cluster. If you want to uninstall, a “kylin.sh stop” command shuts down KAP and releases all the resources from cluster.

HDInsight application platform is an easy way to distribute, discover, and install applications that you have built for the Hadoop ecosystem. HDInsight application platform automatically provisions an “Edge Node” for the application, and installs/configures the Hadoop clients on it. Developers only need to provide a shell script to install the application, and other processes are handled by Azure. This is a fancy feature that other Cloud-based Hadoop platform don’t have; This makes deploying KAP on HDInsight becomes very easy and efficient. Figure 3 shows the architecture of KAP with HDInsight.

Kyligence Big Data Analytics Platform With Azure HDInsight Architecture

Figure 3. KAP With Azure HDInsight Architecture

Both KAP and KyAnalyzer, our agile business intelligence tool, can be installed as HDInsight applications. Users can quickly get a Hadoop-based Data Warehouse + Business Intelligence tool in a couple of minutes.

Advanced users expect to optimize Hadoop and KAP to have better performance. With Apache Ambari as the centralized configuration management and monitoring platform, users can easily tune the HDInsight parameters and then sync them to all nodes. KAP reads the Hadoop configurations at runtime, so most of the Hadoop changes can take effective immediately without restarting KAP. For KAP/Kylin related configurations, users can login into the edge node with SSH, which is also easy to do.

How to Install KAP Plus

KAP Plus is now available on Microsoft Azure. Our customers have already started to use KAP Plus for their business. We, Kyligence Inc., are also using it in development, testing, training, live demo, prof of concept and other scenarios. To enable it, you just need a couple of steps:

1) New an HDInsight cluster whose type is “Hadoop” and OS is “Linux”. It can be an existing cluster or a new cluster.

Selecting

Figure 4. Select “Hadoop” as Cluster Type

2) If it is a new cluster, you need to switch to the “Custom” wizard, select “Kyligence Analytics Platform Plus 2.5” in the third step, as below:

Kyligence Big Data Analytics Platform Installation

Figure 5. Select KAP to Install

If it is an existing Hadoop cluster, you need to click the “Applications” icon, and then click “Add” button. In the application list, select “Kyligence Analytics Platform Plus” to install. Clicking “Purchase” button means you agree with the terms to install. It won’t cause additional charge for KAP; You only need to pay for the Azure resources that you’re using.

Kyligence Big Data Analytics Platform With HDInsight Applications

Figure 6. KAP in HDInsight Applications

3) After the cluster is created, you will get the KAP and KyAnalyzer URL from Azure Portal. Clicking them will lead you to the KAP web portal and KyAnalyzer web portal. You can also get the SSH endpoint of the Edge Node that running KAP, so that you can login the Linux command line remotely.

Kyligence Analytics Platform and Kyligence Analyzer URL

Figure 7. KAP and KyAnalyzer URL

Request Trial License and Login

On the first time visiting, KAP will ask for a license. If you don’t have a license, you can easily request a free trial license here. Click “Apply Evaluation License”, fill in your email address, organization name and person name, click “Submit”, it will download a trial license immediately and then you can go ahead. After the trial expires (two months), you can contact Kyligence to extend the trial or purchase a formal license.

Application of Trial License

Figure 8. Apply Trial License

The initial administrator username is “ADMIN” and the password is “KYLIN”. Enter it and then click “Submit” to login. At the first login, KAP will ask you to update the password to a stronger one. Please remember the new password for future login.

KyAnalyzer integrates with KAP for user authentication, so you just need update the user in KAP once, and then use the same in KyAnalyzer.

Play with Sample Cube

The installation will create a sample “kylin_sales_cube” (as well as sample tables in Apache Hive) in the “learn_kylin” project. In the left navigation, click “Studio” -> “Cube” then you will see the sample cube. It is in “DISABLED” status, you need build it before query. Click “Actions” -> “Build”, and then pick an end date like ‘2014-01-01’, KAP will start a build job.

Sample OLAP Cube Display

Figure 9. Sample OLAP Cube

You can monitor the build progress on KAP’s “Monitor” tab; After the build is finished (progress 100%), the Cube status is changed to “Ready”, that means you can query it with ANSI-SQL in the “Insight” page, e.g.:

select part_dt, sum(price) as total_selled, count(distinct seller_id) as sellers from kylin_sales group by part_dt order by part_dt;

KAP will return results quickly. You can run the same query in to compare the performance.

Running Kyligence Analytics Queries

Figure 10. Run Query in KAP

Now you know how to use KAP to accelerate your data analysis. But writing SQL is still troublesome for most users. You can use KyAnalyzer to analyze the data by drag-and-drop.

Import Data to KAP

KAP supports Apache Hive and Apache Kafka as the data source. Hive is for batch processing; Kafka is for streaming processing.

To use your batch data in KAP, you need to describe your files as a Hive tables as the first step. HDInsight supports using Azure Blob Store and Azure Data Lake as the storage for Hadoop, so you can easily manage and process the data on Cloud with high availability, long durability, and low-cost. Here is an example of uploading files to Azure Blob Store with command line:

export AZURE_STORAGE_ACCOUNT=<your storage account>
export AZURE_STORAGE_ACCESS_KEY=<your storage account access key>

# list all files in container
azure storage blob list <container>
# upload a file to container
azure storage blob upload <path of a local file> <container> <name in container>

Although Azure Blob Store is not a real file system, it can use “/” as the separator in the file name to simulate the folder structures. The following command will upload the local file “airline_2015_01.csv” to container “mycontainer”, and use “airline/2015/airline_2015_01.csv” as the remote path:

azure storage blob upload airline_2015_01.csv mycontainer airline/2015/airline_2015_01.csv
Once the files be uploaded to Azure Blob Store, you can use HiveQL to create the table. You can do this in the “Hive view” of HDInsight Ambari, or use the Hive command line on a HDInsight node. Below is an example of creating a partitioned Hive table with CSV files from Azure Blob Store:

hive> CREATE EXTERNAL TABLE airline_data (
Year int,
Quarter int,
Month int,
DayofMonth int,
DayOfWeek int,
FlightDate date,
...
)
PARTITIONED BY (Part_year STRING)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES ("separatorChar" = ",")
LOCATION 'wasb://@.blob.core.windows.net/airline'
TBLPROPERTIES('serialization.null.format'='','skip.header.line.count'='1');
Then you can add partitions like this:

hive> ALTER TABLE airline_data ADD PARTITION (Part_year = '2015') location 'wasb://@.blob.core.windows.net/airline/2015';

When the table created and has at least 1 partition, you can query it with SQL:
hive> select * from airline_data limit 100;

Now you have successfully created a Hive table with your data files on Azure cloud storage. You can login KAP web GUI, in the “Studio” -> “Datasource” page, click the “Hive” button to import table metadata into KAP:

Importing Apache Hive Tables With Kyligence

Figure 11. Import Hive Table

The import only synchronizes the table metadata, like column name. The source data is still in origin place, so this operation will be fast.

To use Apache Kafka as the data source, you need have a running Kafka cluster (v0.10+), which can be connected from the HDInsight cluster that running KAP. That means your Kafka brokers need have public network address, or they are in the same v-net with KAP. With this pre-requisite matched, you can follow the “Streaming Cube” tutorial on KAP manual to create the cube.

Create Model and Cube

With all necessary tables be imported, you can start to define your data model. The model is a basis for Cube, and it can be reused by multiple Cubes. In the “Studio” tab, click “+Model”, enter the name, and then drag and drop the tables to the model designer canvas.

On each table, click the “setting” icon and then mark it as “Fact table” or “Lookup table”. The fact table will be high lighted in blue. You can link two tables by drag one column from fact table to another to setup the FK/PK relationship.

KAP will automatically detect columns’ type and then determine it is a dimension, measure or none, with “D”, “M” or “-” as the prefix. If it is not accurate, you can click on the prefix to modify.

Creating Data Models With Kyligence

Figure 12. Create Data Model

On finish, click “Save” to save the data model. KAP will automatically to run a job to check and collect statistics of the model. You can track the job progress in the “Monitor” tab. The statistic will help KAP to understand model and give you advice when create Cube.

Now you can create a Cube. Cube is a data structure which has dimensions and measures. KAP supports tens of dimensions in a Cube, and you can define the most common measures like SUM, COUNT, MAX, MIN, DISTINCT COUNT and some others. Click the “+Cube”, a wizard will guide you to finish these step by step.

Creation Process for OLAP Cube

Figure 13. Create OLAP Cube

Build Cube

After the Cube be created, you need build data into it. In the Cube list, click “Action” -> “Build”. If your data model is partitioned, you need specify a date range for the source data; otherwise it will build on full data set.

Building OLAP Cubes With Kyligence Analytics Platform

Figure 14. Build OLAP Cube

When trigger a Cube build, a job is created and you can monitor the progress in the “Monitor” page. The build can take a couple of minutes to several hours, depends on your data size and cluster capacity. You can expand the job to see each steps. If a step is a MR job, it shows the link to job on Hadoop resource manager, where you can track the detail progress. Once the build is finished, the Cube status will be changed to “Ready” automatically, which means it can serve your queries.

Run Queries

Click the “Insight” tab, you can compose a SQL query and then click “Submit”; The query will be served with Cube; and you can preview the result in the web page.

Use KyAnalyzer for Agile BI

Login KyAnalyzer with the same user as KAP, in the ” Console” page, click “Sync Cubes from Kylin”, select the “kylin_sales_cube” to sync. The Cube will be imported to KyAnalyzer.

Sync OLAP Cubes to Kyligence Analyzer

Figure 15. Sync OLAP Cube to KyAnalyzer

Click the “New query” to open a new page, click the refresh button beside “Cubes” to see the new loaded model, select “kylin_sales” in the dropdown list. Now you see the measures and dimensions. Click one measure and one dimension, it will query the data and generate a data table.

Selection of Dimensions and Measures in Kyligence Analyzer Solution

Figure 16. Select Dimension and Measure in KyAnalyzer

Click the “Chart Mode” icon in the upper right corner, KyAnalyzer will generate diagram based on the table, as below shows. Isn’t it easy?

Chart Generation for Kyligence Analyzer Solution

Figure 17. Generate Chart in KyAnalyzer

Congratulations! You have built your first Cube and created the first chart with drag and drop. You can also integrate KAP with more visualization tools like Tableau, PowerBI/Excel.

Connect from Tableau, PowerBI

To connect from Tableau or PowerBI, you need install the Apache Kylin ODBC driver (only Windows supported). The detail guide can be found in https://docs.kyligence.io/v2.5/en/integration/

KAP on HDInsight works with HTTPS protocol on 443 port, so when configure the ODBC connection, please make sure using the correct URL and port. Figure 18 is a sample from Tableau.

Kyligence Analytics Platform Connection to Tableau

Figure 18. Connect From Tableau to KAP

Kafka on HDInsight Integration

Apache Kafka is a popular open-source stream-ingestion broker. It can handle large numbers of reads and writes per second from thousands of clients. KAP supports using Apache Kafka (v0.10+) as the data source. You can flow your data to Kafka topics continuously, and then build data into Cube with incremental build jobs. With this integration, it can decrease the data latency from days/hours to minutes level.

Kafka on HDInsight provides you with a managed, highly scalable, and highly available service in the Azure cloud. You can create a Kafka cluster from Azure portal quickly. To ensure the Kafka version in KAP node is compatible with the Kafka cluster, suggest you use the same HDInsight version. Otherwise, you need manually download and install the matched version of Kafka in KAP instance.

HDInsight does not allow direct connection to Kafka over the public internet. In order to connect with Kafka from KAP, you need make these two clusters share the same V-Net, or use VPN solution. The former should be the simplest way, you can use it for a new cluster.

By default, KAP doesn’t have Kafka client shipped; So you will get a “ClassNotFoundException” error when connect to a Kafka cluster. To fix this, you need SSH into the KAP node, manually specify KAFKA_HOME environment variable in KYLIN_HOME/bin/find-kafka-dependency.sh:

ssh sshuser@KAP.CLUSTERNAME-ssh.azurehdinsight.net
sudo vi /usr/local/kap/bin/setenv.sh
In the beginning position, add (remember to update the HDP version):

export KAFKA_HOME=/usr/hdp/2.5.5.3-2/kafka
Then you can verify whether it can find the dependent jar successfully, like this:

sshuser@ed10-kapdem:~$ /usr/local/kap/bin/find-kafka-dependency.sh -v
Turn on verbose mode.
Retrieving kafka dependency...
KAFKA_HOME is set to: /usr/hdp/2.5.5.3-2/kafka, use it to locate kafka dependencies.
kafka dependency: /usr/hdp/2.5.5.3-2/kafka/libs/kafka-clients-0.10.0.2.5.5.3-2.jar
If OK, then restart KAP service to take effective:

sudo systemctl restart kap
When KAP is restarted, go back to the “Studio” -> “Datasource” page, click “Kafka” button, enter the Kafka broker host and port info, it will connect to it, list the topics and then show sample messages, as Figure 19 shows.

Apche Kafka Integration With Kyligence Analytics Solution

Figure 19. Kafka Integration

For the next steps, just follow the “Streaming Cube” section in KAP manual.

Practices of KAP on HDInsight

Here are some practices that can help you in using this solution better.

1. Use dedicated virtual network for HDInsight

If not specified, Azure will use a default virtual network for each HDInsight cluster. The default virtual network is invisible in Azure portal; so once the cluster be created, you couldn’t add other instances to it nor customize its network firewall. In order to keep the flexibility for future, create a dedicated virtual network and then use it for the new HDInsight is recommended. The virtual network need be associated with a network security group, which need allow the connections for HDInsight required hosts and ports. You can check HDInsight documentation for the details.

2. Optimize Hadoop settings

The default Hadoop settings of HDInsight might not be the best for KAP, it may allocate much resources to HBase but leave less for MapReduce. Besides, as the Cubes are immutable in HBase, we can allocate more resource to block cache to gain better read performance. You can optimize these settings in the HDInsight dashboard (Ambari Web UI). Here are some recommended settings:

Adjustment of Hadoop Settings of HDInsight

After update these parameters, remember to sync the changes to cluster nodes and then restart the services to take effective.

3. Use multiple storage accounts

Using multiple storage accounts will better leverage the network bandwidth, so to improve the performance. You can use separated storage accounts for the source Hive tables.

4. Scale and shrink on demand

With data in Azure Blob Store or Azure Data Lake, you can easily scale and shrink the cluster without worry about data loss. For example, when you build a big Cube, you can scale out it to get more concurrencies; After the build is finished, you can shrink it back to save the cost.

The Spark executor for KyStorage has enabled dynamic resource allocation, you can manually adjust the max/min executors in KAP’s“kylin.properties”, for example:

kap.storage.columnar.spark-conf.spark.dynamicAllocation.enabled=true
kap.storage.columnar.spark-conf.spark.dynamicAllocation.maxExecutors=5
kap.storage.columnar.spark-conf.spark.dynamicAllocation.minExecutors=1
kap.storage.columnar.spark-conf.spark.dynamicAllocation.initialExecutors=1
Restart KAP service to make it effective:

sudo systemctl restart kap
5. Use an external SQL Server as KAP met